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Abstract 

The  Department  of  Defense  (DoD)  relies  heavily  on  information  systems  to 
complete  a  myriad  of  tasks,  from  day-to-day  personnel  actions  to  mission  critical  imagery 
retrieval,  intelligence  analysis,  and  mission  planning.  The  astronomical  growth  in  size 
and  performance  of  data  storage  systems  leads  to  problems  in  processing  the  amount  of 
data  returned  on  any  given  query.  Typical  relational  database  systems  return  a  set  of 
unordered  records.  This  approach  is  acceptable  in  small  information  systems,  but  in  large 
systems,  such  as  military  image  retrieval  systems  with  more  than  1  million  records,  it 
requires  considerable  time  (often  hours  to  days)  to  sort  through  thousands  of  records  and 
select  the  relevant  for  analysis. 

This  research  introduces  Intelligent  Query  Answering  (IQA)  as  a  novel  approach 
to  information  retrieval.  IQA  implements  the  FOIL  algorithm  to  learn  rules  based  upon 
user  feedback  [QUI90].  The  Winnow  algorithm  adjusts  rule  weights  based  on  user 
classification,  for  improved  document  orderings  [BLU97].  A  semantic  tree  specific  to  the 
domain  allows  rule  generalization  across  the  domain. 

Testing  shows  a  document  sort  accuracy  rate  of  63-93%  against  a  controlled  test 
dataset  and  78-89%  accuracy  rate  on  a  subset  of  declassified  National  Air  Intelligence 
Center  imagery  metadata.  These  results  demonstrate  that  this  research  provides 
groundwork  for  future  efforts  in  rule  learning  and  rule  generalization  in  the  information 
retrieval  field. 
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INTELLIGENT  QUERY  ANSWERING 
THROUGH  RULE  LEARNING  AND  GENERALIZATION 


1.  Introduction 


1.1  Purpose 

In  today’s  world  most  organizations  rely  heavily  on  information  and  information 
technology  to  conduct  day-to-day  activities.  Recent  events  in  the  war  against  terrorism 
illustrate  the  critical  need  for  real-time,  accurate  intelligence  information.  The  ability  of 
the  Department  of  Defense  and  the  Air  Force  to  accomplish  their  mission  relies  heavily 
on  the  ability  to  process  a  tremendous  amount  of  data,  both  text  and  imagery  for 
intelligence  analysis. 

Over  the  years,  millions  of  records  have  been  collected,  cataloged,  digitized,  and 
stored  in  large  databases.  Data  storage  systems  are  continually  expanding  to  meet  the 
ever-increasing  demand  for  more  capacity.  It  is  common  to  find  a  personal  computer  with 
40-120  gigabytes  of  hard  disk  storage.  Large  computer  systems  measure  storage  in  terms 
of  terabytes  (1  terabyte  =  1024  gigabytes),  and  now  systems  are  even  entering  the  2- 
petabyte  range  of  capacity  (1  petabyte  =  1,024  terabytes)  [XIN03]. 

As  storage  capacity  increases,  the  computational  cost  of  manipulating  that 
information  also  increases.  The  available  information  is  overwhelming  to  even  the  most 
accomplished  information  processing  organizations.  This  problem  becomes  more 
pronounced  in  systems  with  heterogeneous  data  collections.  Returning  a  set  of  hundreds 
or  thousands  of  unordered  records  dramatically  increases  the  time  spent  sorting  through 
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the  data  to  find  the  desired  information.  To  be  an  effeetive  tool  for  users,  eomputer 
systems  must  have  more  sophistieated  ways  of  returning  relevant  information  to  user 
queries. 

This  researeh  introduees  Intelligent  Query  Answering  (IQA)  as  a  novel  approach 
to  information  retrieval.  IQA  uses  a  modified  version  of  Quinlan’s  FOIL  algorithm  to 
learn  rules  based  upon  user  search  terms  and  classification  of  returned  documents 
[QUI90].  The  Winnow  algorithm  adjusts  the  rule  weights  based  on  previous  user 
classifications,  improves  the  order  of  the  sorted  documents  returned,  and  the  process 
repeats  [BLU97].  A  semantic  tree  specific  to  the  domain  allows  rule  generalization.  This 
provides  users  with  documents  sorted  with  the  assistance  of  generalized  rules  where  none 
previously  existed,  and  also  generalizes  similar  sets  of  specialized  rules 

1,2  Background 

Several  research  efforts  at  the  Air  Force  Institute  of  Technology  have  focused  on 
improving  user  access  to  relevant  information  with  the  National  Air  Intelligence  Center’s 
(NAIC)  the  Imagery  Exploitation  Capability  (lEC)  System.  The  lEC  System  is  an 
operational  system  in  need  of  improvement.  Some  of  these  efforts  have  included 
improving  the  methods  of  returning  relevant  information  by  using  multi-modal  feedback 
[WIE03]  and  by  improving  the  graphical  user  interface  [BAC03].  These  efforts  have 
made  significant  strides  in  ordering  records  returned  by  relevance  in  user  queries,  but 
require  an  extensive  number  of  queries  on  the  lEC  System  to  build  effective  structures 
that  improve  search  results  and  overall  query  performance.  The  need  to  develop  better 
methods  for  providing  relevant  information  faster  is  clear. 
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1.2.1  NAIC  lEC  System  Background 

The  NAIC  uses  the  lEC  System  to  store  and  retrieve  images  used  for  intelligence 
analysis  and  planning.  This  system  has  been  in  use  four  years  and  consists  of  more  than 
1.3  million  images  with  associated  metadata  [BAC03].  It  consists  of  a  database  and  an 
image  library.  The  database  stores  the  metadata  for  each  image  and  has  a  hyperlink.  The 
metadata  in  each  record  describes  the  image  while  the  hyperlink  points  at  the  respective 
image  in  the  image  library.  The  goal  of  the  system  is  the  retrieval  of  military  images  in 
support  of  timely  intelligence  analysis.  Although  a  relatively  new  software  product,  lEC 
has  an  extraordinarily  slow  response  time  (minutes)  and  returns  unordered  sets  of 
records,  5  records  at  a  time.  Most  of  a  researcher’s  time  is  spent  waiting  for  lEC 
responses. 

1.2.2  lEC  Operations 

The  NAIC  employs  more  than  700  personnel  who  use  the  lEC  system.  A  person 
assumes  one  of  four  specific  roles  using  this  system;  photographer,  commenter, 
researcher,  or  analyst  [BAK03,  DIA03].  Photographers  are  responsible  for  acquiring 
imagery.  Commenters  digitize  the  imagery  and  store  them  in  the  lEC  System.  They  also 
add  comments  (metadata)  to  the  system  that  describe  an  image.  The  images  are  stored  in 
the  imagery  library  and  the  metadata  is  stored  in  the  relational  database.  Researchers 
receive  requests  for  specific  image  content  and  search  the  system  for  images  that  assist 
the  requesting  analyst.  One  or  more  query  terms  are  used  to  search  for  relevant  images. 
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much  like  one  would  use  an  Internet  search  engine.  A  researeher  makes  note  of  any 
relevant  images  and  passes  that  information  to  an  analyst.  Analysts  review  these  images 
and  provide  analysis  for  the  intelligence  eommunity. 


1.2,3  lEC  Issues 

The  lEC  system  is  an  enormous  relational  database.  Eaeh  reeord  eontains  a  link  to 
the  respeetive  image  it  represents  in  the  image  library.  It  responds  to  a  researcher’s  query 
by  providing  a  complete,  unordered  list  of  records  containing  only  documents  that 
include  all  query  seareh  terms  in  the  metadata.  Eurthermore,  these  queries  cannot  be 
Boolean. 

Boolean  searches  use  the  logieal  operators  and,  not  and  or.  The  Boolean  and 
means  that  all  the  terms  specified  must  appear  in  the  doeument(s).  The  Boolean  or  means 
that  at  least  one  of  the  terms  speeified  must  appear  in  the  document(s).  The  Boolean  not 
means  that  at  least  one  of  the  terms  you  speeify  must  not  appear  in  the  doeument(s). 
Combinations  of  these  terms  can  provide  an  effective  return  of  documents  albeit  without 
regard  to  relevanee. 

Since  the  lEC  does  not  have  Boolean  seareh  capability,  a  user  may  not  search  by 
and-mg,  or- ing,  or  not-'mg  terms  together  to  inerease  the  effective  return  of  records.  All 
query  terms  must  have  a  matehing  term  in  the  eaeh  reeord’s  metadata  (effectively  all 
terms  and-od  together)  for  the  lEC  to  return  the  record.  Additionally,  term  order  has  no 
relevance  in  the  lEC. 

lEC  returns  five  records  at  a  time  and  the  time  delay  for  the  appearanee  of  the  first 
set  of  records  is  usually  greater  than  30  seeonds  and  often  as  long  as  eight  minutes 
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[DIA03].  Within  each  record  returned  is  also  a  hyperlink  for  the  image  associated  with 
the  metadata.  In  order  for  the  researcher  to  view  an  image,  they  must  click  on  this 
hyperlink  and  retrieve  the  image.  The  time  delay  between  the  researcher  selecting  the 
image  and  the  image  appearing  can  be  as  long  as  two  minutes  [DIA03].  The  time  to 
change  from  one  set  of  five  records  to  the  following  five  takes  from  two  to  five  minutes 
[DIA03].  This  delay  occurs  each  time  the  researcher  requests  a  new  set  of  five  records. 
The  lEC  with  its  1.3  million  images  frequently  returns  hundreds  of  unordered  records. 
Occasionally  a  query  results  in  more  than  a  thousand  records  returned.  This  makes  the 
task  of  finding  relevant  images  tedious  and  time  consuming,  with  individual  searches 
taking  hours  or  days  to  complete.  Given  the  number  of  records  routinely  returned,  there  is 
a  substantial  possibility  that  the  researcher  will  never  see  records  deep  in  the  returned  list. 

Other  approaches  using  modern  information  retrieval  methods  to  improve  the  lEC 
system  capabilities  have  been  studied.  These  approaches  have  been  somewhat  successful, 
but  the  basis  for  this  research  is  the  exploration  of  an  alternative  method  of  returning 
relevant  records  using  machine  learning  techniques.  The  lEC  provides  a  useful  source  of 
data  for  study.  Section  2.2  presents  an  overview  of  some  information  retrieval  methods  to 
provide  a  contrast  for  the  basis  of  this  research. 

1,3  Research  Focus 

The  primary  focus  of  this  research  is  the  introduction  and  exploration  of  a  new 
method  of  information  retrieval  that  blends  rule  learning  through  user  search  and 
document  classification  with  rule  generalization.  Teaming  mles  through  user 
classification  provides  the  basis  for  returning  records  sorted  by  relevance.  Generalizing 
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those  learned  rules  aeross  a  predefined  semantic  tree  provides  a  “best  guess”  return  of 
relevant  documents  based  upon  existing  rules.  The  goal  is  a  system  that  rapidly  learns 
how  a  user  queries  a  database,  and  then  uses  those  rules  to  return  the  most  relevant 
documents. 

1.3.1  Objectives 

This  research  has  two  objectives.  The  first  objective  is  the  identification  and 
implementation  of  an  effective  rule  learning  system,  including  user  feedback  and 
relevance  assignment.  Rule  learning  and  rule  weight  adjusting  add  relevance  to  each 
document.  This  provides  a  method  of  returning  documents  in  order  of  relevance.  The 
second  objective  is  defining  a  data  structure  to  represent  the  semantic  relationship  of  a 
dataset.  This  structure  would  support  term  generalization  and  and  allow  for  interrogation 
of  that  data  structure.  WordNet  [MIL90]  provides  some  ideas  for  generalizing  terms. 
Rule  generalization  adds  additional  relevance  to  documents  and  improves  relevance 
order.  It  also  reduces  computation  time  by  combining  two  or  more  specialized  rules  in  to 
a  more  general  one. 

1.3.2  Approach 

This  research  approach  begins  with  a  review  of  published  literature  on 
information  retrieval,  rule  learning  and  lexicographical  dictionaries.  It  continues  with  the 
selection  and  implementation  of  a  suitable  rule-learning  method.  An  electronic 
lexicographical  dictionary  guides  the  building  of  a  generalization  framework.  Nouns  and 
adjectives  from  the  lEC  System’s  metadata  form  the  generalization  hierarchy.  This 
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hierarchy  provides  the  foundation  for  rule  generalization.  Experiments  generate  and 
generalize  rules  through  user  queries.  An  analysis  of  rules  learned  and  document  return 
order  determines  the  effectiveness  of  the  combined  methodologies. 

1,4  Summary 

The  primary  focus  of  this  research  is  the  introduction  and  exploration  of  a  new 
method  for  information  retrieval.  This  research  presents  and  implements  a  methodology 
for  blending  rule  learning  with  rule  generalization  for  improved  query  results.  Test  and 
result  analyses  validate  the  approach  and  provide  a  way  of  quantify  its  successes.  This 
research  uses  test  data  and  the  de-classified  subset  of  metadata  from  the  lEC  System. 

The  next  four  chapters  present  the  research  and  results  of  this  thesis.  Chapter  2 
provides  an  overview  of  information  retrieval,  rule  learning  methodologies  and  the 
WordNet  lexical  dictionary.  Chapter  3  discusses  the  methodology  for  implementing  this 
rule  learning  and  rule  generalization  system.  Chapter  4  presents  testing  and  analysis  of 
test  results.  Chapter  5  concludes  the  research  with  conclusions  and  recommendations  for 
future  work. 
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2,  Background 


2.1  Introduction 

This  chapter  provides  baekground  information  useful  for  establishing  a 
foundation  for  this  researeh  effort.  It  provides  a  brief  discussion  on  information  retrieval 
methods,  rule  learning  and  lexieal  referenee  systems. 

2.2  Information  Retrieval  Methods 

Information  retrieval  (IR)  methods  include  systems  for  indexing,  searching  and 
recalling  data,  particularly  text  or  other  unstructured  forms.  While  there  are  a  number  of 
methods,  the  three  most  widely  used  and  well  known  are  the  Boolean,  probabilistic,  and 
vector  methods. 

2.2,1  Boolean  Method 

The  Boolean  method  uses  a  set  of  keywords  assoeiated  with  each  record  within  a 
system.  These  keywords  are  the  index  terms.  Users  type  in  one  or  more  of  these  index 
terms  to  retrieve  records  that  match  these  terms.  The  Boolean  operators  are  and,  or  and 
not.  Mixing  two  or  more  terms  with  one  or  more  Boolean  operators  refines  the  search, 
and  can  reduce  or  inerease  the  number  of  records  returned.  The  combination  of  these 
terms  is  a  search  query,  and  the  Boolean  retrieval  system  returns  records  based  on  these 
queries. 
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Let  X„  represents  a  term  in  a  query.  In  the  query  {{Xi  and  Xi)  or  (X3  and  X4)  and 
not  Xj],  retrieved  records  must  contain  the  term  pairs  Xj  and  X2,  or  X3  and  X4,  or  both. 
However,  none  of  the  records  can  contain  X5. 

While  the  Boolean  method  is  widely  used,  it  has  limitations  and  disadvantages. 
One  of  the  primary  disadvantages  is  that  many  users  have  no  understanding  of  Boolean 
logic.  This  hinders  their  capability  for  building  effective  queries.  Furthermore,  Boolean 
logic  is  quite  unyielding  in  a  retrieval  system  when  using  or-ed  only  or  and-Qd  only 
terms.  The  presence  of  one  of  the  terms  X„  in  a  record  in  a  query  (X;  or  X2  or  X3  or  X#  or 
X5)  returns  that  record.  Conversely,  the  absence  of  only  one  of  the  terms  X„  in  a  record  in 
the  query  (X;  and X2  and X3  and X4  and X5)  rejects  that  record  [SLA91]. 

Even  cogent  Boolean  search  string  is  limited  by  the  order  of  returned  records. 
Boolean  retrieval  methods  on  large  information  systems  can  return  huge  sets  of 
unordered  or  poorly  ordered  records.  Since  it  is  now  much  easier  to  store  vast  amounts  of 
information,  users  must  have  the  ability  to  retrieve  desired  records  efficiently.  Finding 
capable  methods  of  quickly  returning  the  most  relevant  information  to  users  is  a  priority 
for  many  in  computational  research  arenas. 

2.2,2  Probabilistic  Method 

Marion  and  Kuhms  first  presented  the  probabilistic  approach  to  information 
retrieval  (IR)  [MAR60].  The  probabilistic  approach  seeks  to  the  answer  to  the  question 
[JON98]; 

“What  is  the  probability  that  this  document  is  relevant  to  this  query?” 
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The  answer  to  this  question  begins  with  an  ordered  doeument  set  from  the  entire 
doeument  collection.  The  problem  is  a  user  does  not  know  what  this  set  should  look  like 
unless  they  inspect  each  document.  Therefore,  the  probabilistic  model  provides  an  initial 
starting  point  and  adjusts  relevance  through  user  feedback  compiled  over  several  searches 
[BAZ99].  Estimating  a  starting  point  can  be  computationally  inefficient  [CRE98]. 

2,2.3  Vector  Method 

The  Boolean  model  assumes  that  all  index  terms  have  an  equal  weight.  IR  vector- 
based  systems  add  a  numeric  weight  to  each  term,  expanding  the  computational 
possibilities.  This  improvesof  the  system  through  the  application  of  a  variety  of 
probabilistic  methods.  Such  systems  are  known  as  relevance  feedback  systems.  In  a 
relevance  feedback  system,  the  terms  in  each  document  have  relevance  weights.  A  query 
combined  with  a  set  of  documents  creates  a  new  and  presumably  more  useful  query 
[ALL95]. 

Text  categorization  is  the  process  of  assigning  term  relevance  and  frequently  uses 
two  approaches.  Each  approach  makes  use  of  a  bag-of-words  representation  that  looks  at 
documents  as  bags-of-words  without  considering  word  order.  Each  approach  assigns  a 
value  to  a  set  of  attributes,  sometimes  called  features,  based  on  the  function  of  the 
respective  approach.  In  both  approaches,  each  distinct  word  is  a  feature  and  the  number 
of  times  the  word  occurs  within  a  document  determine  its  value.  Since  there  is  no 
consideration  of  word  order,  some  information  is  lost  with  this  representation  [JOA97]. 

One  such  method  is  the  Term  Erequency,  Inverse  Document  Erequency  (TFIDE) 
approach  [JOA97].  This  method  represents  each  document  as  a  vector  with  weights  based 
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on  TFIDF.  Documents  with  similar  content  have  similar  veetors.  There  is  a  direet 


correlation  with  the  angle  between  two  vectors  and  the  number  of  matching  terms.  The 
smaller  the  angle  between  two  doeument  vectors,  the  more  similar  the  doeuments. 

TFIDF  calculates  a  vector  for  a  user  query  and  compares  the  user  query  vector 
with  all  the  doeument  vectors,  returning  an  ordered  list  of  doeuments.  TFIDF  ranks  eaeh 
doeument  veetor  with  respeet  to  the  query  veetor,  using  the  angle  between  the  two  for 
determining  the  rank.  The  smallest  angle  reeeives  the  highest  rank.  While  this  method 
provides  more  effective  retrievals,  it  also  substantially  increases  computational  effort 
[SLA91]. 

Another  method  uses  one  of  the  many  NaiVe  Bayes  classifying  algorithms.  These 
classifying  algorithms  use  Bayes’  rule  to  simplify  computations  by  assuming  all  term 
elasses  are  independent.  The  classifier  determines  which  class  or  classes  the  doeument 
belongs  in.  The  algorithm  assigns  doeuments  to  one  or  more  elasses  and  sorts  them.  User 
queries  can  then  quickly  retrieve  classified  doeuments. 

Other  methods  explore  a  variety  of  doeument  ranking  techniques,  sueh  as 
eonsidering  passages  derived  from  eomplete  documents  [WIL94],  or  from  clusters  of 
paragraphs,  or  from  arbitrarily  lengths  of  long  strings  of  related  sentences  [HEA93].  In 
addition,  probabilistic  methods  applied  to  searehes  using  TFIDF  extract  better  results 
than  with  TFIDF  alone  [JOA97].  However,  most  of  these  methods  still  rely  on  structured 
analysis  of  the  documents’  terms  and  the  query,  and  do  not  gain  knowledge  from  a  user 
classifying  the  results. 
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2,3  Rule  Learning 

Rule  learning  is  different  from  IR  methods;  eoneepts  in  the  form  of  rules  are 
stored  and  used  for  future  searehes.  A  diffieult  aspeet  of  any  sort  of  maehine  learning  is 
the  presentation  of  results  in  human  readable  form.  If-then  rules  provide  one  of  the  most 
expressive  and  understandable  forms  of  knowledge  representation  [MIT97].  Rule 
learning  algorithms  vary  in  the  way  they  seareh  a  training  data  set  and  in  the  way  they 
generalize.  They  also  differ  in  the  way  they  represent  elass  deseriptions  as  well  as  how 
they  eope  with  errors  and  noise  in  the  training  data.  The  seetions  following  this  overview 
introduee  strategies  and  diseuss  a  number  of  methods  of  learning  rules  in  this  form. 

Coneept  learning  systems  are  differentiated  by  the  eomplexity  of  the  input  and 
output  languages  they  use.  Learning  systems  that  use  propositional  approaehes  lay  at  one 
extreme  of  eomplexity,  logieal  inferenee  systems  at  the  other.  The  former  lends  itself  to 
more  simplistie  representation,  using  eonjunetions  {and)  and  disjunetions  (or)  of 
proposition  terms.  This  simplieity  makes  sueh  systems  suitable  for  large  volumes  of  data. 
They  represent  eoneepts  as  eolleetions  of  examples  and  eounter  examples,  and  thus  ean 
exploit  the  statistieal  properties  of  these  eolleetions  [QUI90].  The  latter  aeeepts 
deseriptions  of  eomplex,  struetured  entities,  and  generates  elassifieation  rules  and 
expresses  them  in  first-order  logie. 

2,3,1  Rule  Learning  Strategy 

Learning  a  eoneept  is  aehieved  through  one  of  two  methods;  simultaneous 
eovering  algorithms  and  sequential  eovering  algorithms  [ART99].  The  first  eategory 
ineludes  deeision  trees,  while  the  seeond  ineludes  direet  rule  learning  algorithms. 
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2.3. 1.1  Simultaneous  Covering  Algorithms 

Decision  tree  algorithms  learn  the  entire  set  of  disjunctions  simultaneously  as  part 
of  one  search  through  a  selected  decision  tree  [MIT97].  These  algorithms  use  a  strategy 
of  overfit-and-simplify  [ART99].  The  algorithms  prune  results  after  a  search  to  reduce 
the  generated  rule  set.  IDS  [QUI86]  and  C4.5  [QUI93]  are  examples  of  simultaneous 
covering  algorithms. 

IDS  uses  a  hill-climbing  approach  to  find  a  locally  optimal  solution  using  a 

greedy  technique.  This  technique  branches  the  decision-tree  and  selects  the  feature  that 

provides  the  highest  information  gain.  This  information  gain  reduce  the  expected  entropy 

of  a  decision  tree  [COH92].  C4.5  is  an  extension  of  IDS  that  addresses  some  basic 

problems  in  IDS  such  as  the  overfitting  of  noisy  data.  [MIT97]  defines  overfitting. 

“Given  a  hypothesis  space  H,  a  hypothesis  /?  e  77  is  said  to  overfit  the 
training  data  if  there  exists  some  alternative  hypothesis  h'^H ,  such  that  h 
has  a  smaller  error  than  h'  over  the  training  examples,  but  h'  has  a 
smaller  error  that  h  over  the  entire  distribution  of  instances.” 

Additional,  C4.5  extensions  include  the  incorporation  of  numerical  attributes  and  discrete 

values  of  a  single  attribute  grouped  together  to  support  more  complex  tests.  C4.5  also 

accepts  missing  attribute  values,and  increases  accuracy  by  post-pruning  rules  after  the 

tree  induction. 

2.3. 1.2  Sequential  Covering  Algorithms 

Sequential  Covering  Algorithms  learn  rules  one  at  a  time.  They  compute  the 
subsets  of  data  being  covered  or  the  subsets  representing  the  decision  class,  and  choose 
the  best  rule  among  alternative  attribute-value  pairs  [ART99].  This  class  of  algorithms  is 
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generally  split  in  to  two  areas;  general-to-specifie  searehes  and  speeific-to-general 
searches.  FOIL  [QUI90],  FOCL  [PAZ97]  and  FOIDL  [M0095]  are  examples  of 
sequential  covering  algorithms. 

2.3. 1.2.1  General-to-Specific  Searches 

One  rule  learning  approach  learns  one  rule  at  a  time  and  organizes  the  search  in 
the  hypothesis  space  the  same  way  simultaneous  covering  algorithms  do,  but  follows 
only  the  most  promising  branch  in  the  tree  at  each  step  [MIT97  and  ART99].  The  search 
begins  by  considering  the  most  general  condition  possible,  the  empty  test  that  matches 
every  instance.  Next,  add  the  attribute  test  that  best  improves  rule  performance  measured 
over  the  training  samples.  This  process  is  repeated  each  time  adding  the  attribute  test  that 
most  improves  rule  performance.  This  process  continues,  greedily  adding  new  attribute 
tests  until  the  hypothesis  reaches  an  acceptable  level  of  performance.  A  single  descendent 
is  followed  at  each  step  whereas  simultaneous  covering  grows  a  subtree  that  covers  all 
possible  values  of  the  selected  attributes. 

2.3. 1.2.2  Specific-to-General  Searches 

The  converse  to  the  general-to-specific  search  begins  the  search  process  with  the 
most  specific  rule  and  gradually  generalizes  over  more  positive  cases.  This  search  relies 
on  positive  examples  to  compute  generalizations  of  clauses  [ESP96]. 
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2,3,2  Rule  Learning  Methods 


The  first  portion  of  this  research  effort  deals  with  the  problem  of  learning  rule 
sets.  To  familiarize  the  reader  with  the  basic  concepts  of  rule  learning  as  it  applies  to 
IQ  A,  the  following  example  is  offered  with  respect  to  NAIC’s  lEC  System. 


2,3,2, 1  lEC  Rule  Learning  Example 

Suppose  a  researcher  seeks  an  image  of  the  left  side  view  of  a  MIG-21  wheel 
well.  The  researcher  uses  the  following  key  words  to  search  for  the  desired  image: 

[left  side  view  MIG  21  Fishbed  wheel  well] 

For  the  purposes  of  this  example,  assume  queries  are  not  case  sensitive.  Records  returned 
include  all  terms  and  any  combination  of  terms  (terms  or-ed  together).  The  researcher 
must  manually  search  for  appropriate  images.  Suppose  the  following  metadata  records 
are  returned: 


1.  [CFOSF  RIGHT  FRONT  UNDFRSIDF  GROUND  PARTIAL  VIEW  OF  A 
MIG-21  FISHBED  BANDT  3046  WITH  CZECH  MARKINGS  DETAILING 
THE  STARBOARD  WHEEL  WELL  REAR  SECTION] 

2.  [CLOSE  FRONT  UNDERSIDE  GROUND  PARTIAL  VIEW  OF  A  MIG-21 
FISHBED  BANDT  3046  WITH  CZECH  MARKINGS  DETAILING  THE 
RIGHT  (STARBOARD)  WHEEL  WELL  REAR  SECTION] 

3.  [CLOSE  INTERIAND  PARTIAL  VIEW  OF  A  MIG-21  FISHBED  BANDT 
3046  WITH  CZECH  MARKINGS  DETAILING  THE  LEFT  (PANDT)  WHEEL 
WELL  LANDWARD  SECTION] 

4.  [CLOSE  LEFT  FRONT  UNDERSIDE  GROUND  PARTIAL  VIEW  OF  A 
MIG-21  FISHBED  BANDT  3046  WITH  CZECH  MARKINGS  DETAILING 
THE  PANDT  WHEEL  WELL  REAR  SECTION] 

5.  [CLOSE  INTERIAND  PARTIAL  VIEW  OF  A  MIG-21  FISHBED  BANDT 
3046  WITH  CZECH  MARKINGS  DETAILING  THE  LEFT  (PANDT)  WHEEL 
WELL  REAR  SECTION] 


Note  that  the  search  terms  included  “MIG  21”  with  no  dash,  and  all  the  records 


contain  “MIG-21.”  This  difference  exposes  a  severe  limitation  of  many  IR  systems. 


15 


Without  an  electronic  thesaurus  that  includes  the  implication  {“MIG  21”^“MIG-21”}  or 
some  other  method  of  defining  them  as  equal,  [MIG  21]  as  a  single  search  term  would 
not  return  a  record  with  [MIG-21]  in  the  metadata. 

The  researcher  examines  each  of  these  five  records  and  categorizes  each  as 
“positive”,  “negative”  or  “non-applicable”.  All  records  returned  have  a  default  state  of 
“non-applicable”  until  changed  it  to  positive  or  negative.  This  default  ensures  that  records 
returned  but  not  categorized  do  not  affect  the  rule  learning  algorithm. 

If  the  researcher  categorized  record  3  as  “positive”,  and  records  I  and  2  as 
“negative”  matches,  and  does  nothing  to  records  4  and  5.  The  learning  algorithm  looks  at 
the  search  input  and  the  metadata  of  the  records  categorized  as  both  good  and  bad  and 
forms  rule  sets.  The  rules  represent  the  knowledge  that  when  “MIG  21”  is  entered  image 
3  is  preferred  over  others. 

After  each  query  and  categorization,  the  system  develops  one  or  more  rules  that 
define  both  good  and  bad  responses  for  a  set  of  search  terms.  If  the  researcher  queries  the 
system  again  with  the  exact  same  terms  and  the  database  information  is  unchanged,  the 
results  will  include  the  records  3,  4,  and  5.  The  records  are  also  ordered  with  record  3 
first,  followed  by  records  4  and  5. 

These  rules  are  stored  as  disjunctions  (or)  of  conjunctions  {and),  in  the  form: 


h  A  A  ^3  A  . . .  A  ^ 


where  ti  one  of  the  rule  terms  and  a  is  the  conjunction  symbol.  Rule  sets  on  different 
lines  form  the  disjunctions  (or)  of  conjunctions.  Each  rule  also  has  a  predicate  associated 
with  it  representing  terms  included  in  the  metadata 


16 


Records  4  and  5  follow  record  3  as  they  contain  terms  that  are  still  within  the 
search  criteria,  but  have  not  been  categorized.  Depending  on  the  rule  learning  method, 
records  1  and  2  follow  records  4  and  5  being  least  relevant  based  upon  existing  rules,  or 
the  returned  set  of  relevant  records  excludes  them. 

2,3,2.2  FOIL 

Quinlan  [QUI93]  describes  FOIL  as  “...a  learning  system  that  constructs  Horn 
clause  programs  from  examples.”  It  uses  a  separate-and-conquer  approach  rather  than  a 
divide-and-conquer  approach  [PAZ92].  FOIL  is  a  non-incremental  learner  that  uses  a 
hill-climbing  technique  guided  by  a  metric  based  on  information  theory  [PAZ92].  FOIL 
inductively  generates  Horn  clauses  similar  to  the  way  ID3  generates  decision  trees  using 
attribute-value  tests  [QUI86].  The  difference  is  FOIL  measures  information  gain  and  uses 
it  to  classify  examples  that  have  higher  gain. 

FOIL  has  two  basic  operations,  starting  a  new  empty  clause,  and  adding  a  term  to 
the  end  of  that  clause.  The  second  operation  repeats  until  no  negative  example  is  covered. 
The  process  repeats  until  the  set  of  clauses  cover  all  positive  examples.  FOIL  finds 
definitions  from  relations  iteratively  using  this  method. 

FOIL  includes  efficient  methods  adapted  from  attribute-value  learning  systems 
and  develops  inexact  but  useful  rules.  It  also  can  find  recursive  definitions,  but  does  not 
possess  the  capability  to  express  functions  within  Horn  clauses.  FOIL  requires  training 
sets  that  include  both  positive  and  negative  examples,  and  cannot  form  new  predicates. 
Finally,  FOIL  is  based  on  a  short-sighted,  greedy  algorithm  which  can  be 
computationally  very  expensive. 
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2.3.2.3  First-Order  Combined  Learner  (FOCL) 

FOCL  is  an  extension  of  FOIL  that  ineorporates  a  variety  of  baekground 
information  to  expand  the  elass  of  solvable  problems.  The  baekground  information  takes 
the  form  of  rules  and  represents  domain  knowledge.  With  no  baekground  information, 
FOCL  is  equivalent  to  FOIL  [PAZ92].  The  addition  of  baekground  information  takes 
advantage  of  domain  knowledge  whieh  deereases  the  explored  hypothesis  spaee  and 
inereases  the  aeeuraey  of  the  learned  rules. 

This  baekground  information  is  broken  down  in  to  three  elass  extensions.  The 
first  elass  provides  a  method  for  FOCL  to  limit  the  seareh  spaee.  The  seeond  extension 
allows  FOCL  to  use  predefined  rules  outside  the  FOIL  rule  eonstruetor.  The  third 
extension  allows  the  user  to  input  a  partial  rule  that  is  possibly  ineorreet.  FOCL  initially 
approximates  the  predieate  of  the  rule  being  learned.  This  partieular  extension  makes 
FOCL  somewhat  analogous  to  an  induetive  learning  system  [PAZ93]. 

2.3.2.4  First-Order  Induction  of  Decision  Lists  (FOIDL) 

FOIDL  is  another  extension  of  FOIL.  FOIDL  modifies  FOIL  by  representing 
baekground  knowledge  as  a  logie  program.  FOIDL  neither  uses  nor  eonstruets  explieit 
negatives  examples  but  quantifies  over-generality  by  estimating  the  number  of  negative 
examples  eovered.  FOIDL  represents  a  learned  program  as  a  first-order  deeision  list.  This 
approaeh  provides  a  useful  representation  for  problems  with  speeifie  exeeptions  to 
general  rules  [M0095]. 
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2,4  Lexical  Reference  Systems  (WordNet) 


2.4.1  Background, 

WordNet  was  created  at  Princeton  University  in  1985,  when  a  group  of 
psychologists  and  linguists  undertook  the  development  of  a  lexical  database  [M1L90]. 
WordNet  is  an  electronic  dictionary  that  divides  words  into  the  categories  of  nouns, 
verbs,  adjectives,  and  adverbs.  While  WordNet  has  the  same  information  as  dictionaries 
and  thesauri,  it  also  has  many  other  features  beyond  definitions,  synonyms,  and 
antonyms. 

2.4.2  Terms  and  Definitions. 

The  WordNet  organization  structure  consists  of  semantic  relations,  which  are 
relationships  between  meanings  [M1L90].  These  meanings  have  5  categories:  synonyms, 
antonyms,  hyponyms/hypernyms,  meronyms  and  morphological  relations.  The  meanings 
of  the  first  two  terms  are  well  understood,  but  the  other  three  require  definitions. 
Examples  are  provided  to  clarify  their  use  within  WordNet. 

2,4,2, 1  Hyponyms/Hypernyms, 

Two  words,  x  and  y,  are  hyponyms  if  a  relationship  is  expressible  as  ‘'An  x  is  a 
(kind  of)  y  [M1L90].  For  example,  {beagle}  is  a  hyponym  of  {dog},  and  {dog}  is  a 
hyponym  of  {mammal},  while  {mammal}  is  a  hypernym  of  {dog}.  Therefore,  instead  of 
a  lexical  relation  between  word  forms  as  with  synonyms  and  antonyms,  hyponymy  and 
hypemymy  reference  relationships  between  word  meanings. 

This  relationship  is  transitive.  If  the  relation  holds  between  a  first  element  and  a 
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second  and  between  the  second  element  and  a  third,  the  relation  also  holds  between  the 
first  and  third  elements.  For  example  {beagle}  is  a  hyponym  of  {mammal}.  The 
relationship  is  also  asymmetric,  so  {dog}  is  not  a  hyponym  of  {beagle}.  These 
relationships  allow  the  expression  of  hyponyms  and  hypernyms  in  a  hierarchical  semantic 
structure  placing  a  hyponym  below  its  superordinate.  A  hyponym  inherits  all  the  features 
of  its  superordinate  and  adds  at  least  one  feature  that  differentiates  it  from  its 
superordinate,  as  well  as  from  other  hyponyms  of  its  superordinate  [MIL90].  The 
conventions  of  hyponyms  and  hypernyms  provide  the  fundamental  organizing  principle 
for  nouns  in  WordNet  [MIL90]. 

2A.2.2  Meronyms/Holonyms, 

Two  words,  x  andy,  are  meronyms  if  a  relationship  is  expressible  as  “x  has  a  y”: 
e.g.,  {hand}  is  a  meronym  of  {thumb}.  A  holonym  is  the  inverse  to  this  relationship; 
{thumb}  is  a  holonym  of  {hand}  [MIL93].  Meronym  relations  are  transitive  with  some 
qualifications  and  asymmetrical  [MIL90].  WordNet  constructs  hierarchies  using 
meronyms,  yet  this  is  complex  in  many  instances  because  a  single  meronym  can  have 
many  holonyms. 

2,4,2,3  Morphological  Relations 

The  morphology  of  a  word  form  is  an  important  consideration  in  the  practical 
application  of  WordNet.  The  differences  between  singular  and  plural  nouns  and  the 
tenses  of  verbs  although  conceptually  simple  are  difficult  for  computers.  For  example,  if 
a  person  looks  up  the  word  flowers,  WordNet  should  not  respond  by  s&ymg  flowers  is  not 
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in  the  database  whenever  flower  is  present.  The  current  implementation  of  WordNet 
includes  morphological  complexities  of  plural  nouns  and  the  tenses  of  verbs  [WOR03]. 


2A.2A  Semantic  Components  of  Nouns 

WordNet  partitions  nouns  under  a  set  of  semantic  primes.  Table  2-1  shows  this  set 
of  primes  [MIL93].  These  primes  are  the  beginning,  or  prime  semantic  component  of  all 
the  words  structured  below  it.  While  these  sets  vary  greatly  in  size,  they  are  not  mutually 
exclusive,  meaning  some  words  are  included  under  more  than  one  prime.  Words  included 
under  more  than  one  prime  have  more  than  one  sense.  A  lookup  of  WordNet  online  for 
the  word  pen  shows  the  following  [WOR03]; 

“The  noun  "pen"  has  5  senses  in  WordNet. 

1 .  pen  —  (a  writing  implement  with  a  point  from  which  ink  flows) 

2.  pen  —  (an  enclosure  for  confining  livestock) 

3.  playpen,  pen  —  (a  portable  enclosure  in  which  babies  may  be  left  to  play) 

4.  penitentiary,  pen  —  (a  correctional  institution  for  those  convicted  of  major 

crimes) 

5.  pen  —  (female  swan)” 


WordNet  separates  words  contained  within  each  of  these  groups  in  to  individual 
files.  These  files  are  relatively  shallow  in  a  hierarchical  sense.  Lexical  inheritance 
systems  rarely  go  more  than  ten  levels  deep  and  most  that  venture  that  deep  are  technical 
in  nature.  The  prime  list  builds  the  foundation  for  the  noun  arrangement  in  WordNet.  All 
nouns  fit  in  to  one  or  more  categories  (when  a  word  has  a  dramatically  different  sense.) 
This  is  important  when  considering  semantically  related  terms,  which  Section  2.4.4 
explores. 


21 


Table  2-1:  Unique  beginners  for  WordNet  Nouns 


List  of  25  unique  beginners  for  WordNet  Nouns 

{act,  action  activity} 

{natural  object} 

{aminal,  fauna} 

{natural  phenomenon} 

{artifact} 

{person,  human  being} 

{attribute,  property} 

{plan,  flora} 

{body,  corpus} 

{possession} 

{cognition,  knowledge} 

{process} 

{communication} 

{quantity,  amount} 

{event,  happening} 

{relation} 

{feeling,  emotion} 

{shape} 

{food} 

{state,  condition} 

{group,  collection} 

{substance} 

{location,  place} 

{time} 

{motive} 

2,4,3  Adjectives  and  Semantic  Roles 

The  primary  function  of  an  adjective  is  the  modification  of  a  noun.  WordNet 
categorizes  adjectives  as  descriptive  or  relational.  Descriptive  adjectives  express  a  value 
of  an  attribute  to  a  noun  [FEL93].  To  say  The  man  is  tall  assumes  there  is  an  attribute 
Height  such  that  Height(man)  =  tall.  Reference-modifying  adjectives  refer  to  the 
temporal  status  of  a  noun,  such  as  the  former  chief  of  staff  or  the  occasional  drink. 
Others  are  intensifying,  such  as  mere  or  virtual. 

Adjectives  are  treated  completely  different  than  nouns  in  WordNet.  They  have 
both  synonyms  and  antonyms.  Curiously,  when  two  or  more  adjectives  are  synonyms  it  is 
rare  (if  ever)  that  they  have  the  same  antonyms.  WordNet  handles  this  by  using  synonym 
sets,  called  synsets.  Character  tags  within  the  synsets  discriminate  between  synonyms  and 
antonyms  which  allows  a  computer  to  find  a  close  match  to  an  adjective,  by  looking  for  a 
synonym  of  an  antonym. 
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2,4,4  Semantically  Related  Terms 

WordNet  can  identify  semantically  related  terms,  an  important  ability  when 
generalizing  a  term  or  set  of  terms.  Several  identification  methods  exist  for  syntactically 
related  terms.  For  verbs  and  adjectives,  WordNet  uses  synsets  to  determine  like  terms. 
This  is  useful  when  matching  a  query  adjective  with  metadata  adjectives  since  synonyms 
are  relatively  equal.  WordNet  finds  similar  nouns  by  looking  for  the  hyponym  of  the 
superordinate  to  the  query  noun.  It  also  considers  any  noun  pairs  with  the  same 
superordinate  as  similar. 

Figure  2-1  shows  a  hierarchy  for  pen  and  pencil  in  WordNet.  From  the  previous 
information,  a  search  for  a  “/eaJ pencil”  could  be  strongly  generalized  to  a  ‘'pencil”  with 
a  semantic  distance  of  1,  and  less  strongly  to  “slate pencil”  with  a  semantic  distance  of  2. 
There  is  a  semantic  distance  of  4  between  lead  pencil  and  ballpoint  (pen).  This  represents 
a  weaker  generalization,  but  still  a  valid  one  since  both  terms  fall  under  the  hierarchical 
level  of  writing  implements. 

If  a  noun  (or  adjective)  is  replaceable  with  no  loss  in  meaning,  then  a  tight 
synonymous  relationship  exists  between  the  two  difference  terms  [BRE99].  The  semantic 
distance  between  any  two  terms  infers  a  relationship  of  some  weight.  The  phrase  “big 
lead  pencil”  is  replaceable  with  “large  lead  pencil”  with  no  change  in  meaning.  These 
have  a  tight  synonymous  relationship.  The  phrases  “large  pencil”  or  “large  slate  pencil” 
replace  “big  lead  pencil”  with  less  precision,  but  weights  are  computable  by  using 
semantic  distances  of  all  the  different  terms. 
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2,5  Summary 

This  chapter  reviews  several  topics  on  IR  and  rule  learning  essential  for  an 
understanding  of  IQA.  Additionally,  WordNet  tree  expansions,  such  as  the  one  described 
in  Figure  2-1  provide  the  model  for  building  a  semantic  tree  in  IQA.  The  concept  of 
semantic  trees  provides  the  basis  for  generalizing  terms.  Searehing  this  tree  for 
semantically  close  terms  with  existing  rule  assoeiations  makes  generalizing  possible.  The 
next  chapter  defines  and  details  the  IQA  methodology. 
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3.  Methodology 


3.1  Introduction 

Improving  methods  for  searching  though  a  data  structure  is  a  widely  researched 
problem.  User  defined  searches  usually  involve  a  term  or  series  of  terms  that  is  matched 
against  keywords  in  the  data  structure.  Any  match  in  one  or  more  keywords  during  a 
search  causes  that  document  to  be  the  returned.  Based  upon  a  metric,  the  system  presents 
ordered  results. 

Generic  database  system  searches  usually  return  a  set  of  records  in  no  predefined 
order.  As  chapter  two  discussed,  several  approaches  improve  this  by  sorting  data  via  a 
metric.  However,  many  of  these  techniques  require  substantial  preprocessing  and 
updating  after  each  data  modification. 

For  web  searches,  this  metric  could  be  the  number  of  page  hits  a  site  receives. 
More  recent  advances  in  web  searching  include  counting  such  things  such  as  back  links 
to  the  target  site.  A  back  link  is  an  instance  of  one  web  page  containing  a  hyperlink  to  the 
target  page.  The  Google  search  engine  sorts  based  on  the  number  of  back  links  to  a 
particular  web  page  combined  with  number  of  page  visits  for  constructing  a  page  rank 
[BRI98].  The  back  link  calculation  requires  crawling  the  Internet  and  World  Wide  Web 
(WWW)  with  multiple  systems.  This  method,  then,  relies  on  substantial  preprocessing  of 
numerous  web  pages  for  good  results. 

This  research  introduces  Intelligent  Query  Answering  (IQA)  to  develop  the 
relevance  metric.  IQA  techniques  offer  many  benefits  over  searching  databases.  One  is 
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the  system  learns  user-speeifie  rules.  Another  is  the  elimination  of  information 
preproeessing.  Searehes  find  existing  rules,  generate  pseudo  rules,  and  return  matching 
documents.  Document  classification  builds  new  rules  and  reinforces  (either  positively  or 
negatively)  rule  weights. 

The  first  portion  of  this  chapter  clarifies  the  specific  research  objectives.  The  next 
section  presents  the  research  methodology.  The  bulk  of  the  chapter  discusses  the  details 
of  data  preparation,  document  search,  FOIL  and  generalization 

3.2  Research  Objectives 

This  research  develops  new  techniques  that  quickly  return  the  most  relevant 
records  in  database  searches,  as  well  as  good  information  on  semantically  related 
searches  never  seen.  This  research  includes  the  design,  implementation,  and  evaluation  of 
a  rule  learning  and  rule  generalization  system  that  returns  the  most  relevant  records  based 
upon  learned  and  or  generalized  rules.  Specifically,  IQA  learns  rules  using  a  modification 
of  Quinlan’s  FOIL  algorithm  and  adapt  rule  weights  based  upon  user  classifications  and 
the  Winnow  algorithm  [BLU97].  On  subsequent  searches,  IQA  searches  a  semantic  tree 
for  semantically  similar  rules,  and  builds  pseudo  rules  for  the  current  search.  These 
pseudo  rules  provide  additional  documents  and  ordering  relevance.  From  the  user 
classified  returned  documents,  IQA  learns  new  rules  and  reinforces  existing  rule  weights. 

3.3  Solution  Methodology 

The  solution  is  a  multi-tiered  approach  to  adequately  order  records  through  rule 
learning  and  generalization.  The  solution  occurs  incrementally,  as  data  flows  through  the 
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system.  The  first  step  prepares  the  test  data  and  NAIC  data  for  the  IQA  system.  The  next 
tier  builds  a  semantic  structure  that  facilitates  rule  generalization  and  rule  promotion.  The 
subsequent  phases  discuss  the  overall  implementation  of  IQA,  including  search  and  user 
classification,  rule  learning,  and  finally  rule  generalization. 

3,4  Data  Preparation 

Preprocessing  is  necessary  to  reduce  the  amount  of  work  done  by  the  IQA  system. 
The  NAIC  data  source  is  an  image  database  with  corresponding  metadata  and  consists  of 
3265  declassified  records.  The  data  used  for  the  IQA  system  comes  from  the  comments 
field  (CMMNT)  in  the  NAIC  data  file. 

Two  other  fields  provide  contextual  clues  for  the  semantic  tree  building  process. 
The  first  field  is  the  subject  field  (SUBJECT).  This  field  identifies  the  type  of  object  the 
image  represents  such  as  an  aircraft  or  a  guided  missile.  The  second  field  is  the  subject 
description  (SUBJECT_DES).  This  field  gives  the  NATO  designator  of  the  object,  and 
often  includes  additional  information  about  image  type,  such  as  MIG-21 
AETERBURNER. 

Both  of  these  fields  are  useful  to  determineing  the  structure  of  the  semantic  tree 
by  providing  clues  to  look  up  words  to  look  up  in  WordNet.  Eor  example,  MIG-21  and 
EISHBED  would  not  be  in  WordNet,  however,  aircraft  and  fighter  are.  These  clues 
provide  the  majority  of  ideas  for  developing  an  appropriate  hierarchical  structure. 

Data  preprocessing  exacts  the  data  from  an  ODCB  compliant  database  and  saves 
it  to  a  text  file  with  one  field  per  record.  It  separates  records  with  line  feeds  and  removes 
all  extraneous  characters  (punctuation,  double  spaces  and  parentheses.)  Preprocessing 
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also  removes  reeords  with  exaet  duplieate  data  in  the  CMMNT  field,  as  well  as 
prepositions,  eonjunetions  and  determiners.  These  modifications  result  in  2617  records 
and  6652  unique  keywords.  The  resulting  words  are  loaded  in  an  appropriate  Java  object 
structure. 

3.4,1  Semantic  Tree  Building 

WordNet  is  the  inspiration  for  building  a  semantic  tree,  however,  WordNet  is  not 
integrated  into  IQ  A  for  two  reasons.  First,  the  NAIC  data  contains  many  military  terms 
and  NATO  weapon  system  designations  not  in  the  WordNet  database.  To  correct  this 
deficiency,  all  non-existing  terms  would  have  to  be  placed  in  WordNet  using  the 
“grinder.”  The  second  reason  is  the  complexity  of  integrating  WordNet  into  IQA  along 
with  adding  search  generalization.  For  these  two  reasons,  a  custom  semantic  tree  with 
search  term  generalizer  is  developed. 

WordNet’s  semantic  structure  provides  a  foundation  for  rule  generalization.  The 
concept  of  hypemyms  and  hyponyms  provides  a  traversable  tree  structure  for  finding 
semantically  similar  words  and  quantify  their  distance  from  each  other.  Hyponyms 
identify  the  relationship  of  two  words  described  as  “an  x  is  a  (kind  of)  y.”  In  this 
example,  x  is  a  hyponym  of  y,  while  y  is  a  hypernym  of  x.  Figure  3-1  shows  a  small 
semantic  tree.  Appendix  A  lists  the  NAIC  data  included  in  its  semantic  tree. 
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WordNet  concepts  also  assist  with  finding  hypernyms  and  hyponyms  for  the 
NAIC  terms  as  mentioned  earlier.  When  WordNet  does  not  find  a  military  term  or  NATO 
designator,  then  the  logieal  hypemym  is  used.  The  level  of  detail  in  the  semantie  tree 
should  refleet  the  needed  granularity  to  effectively  generalize  rules. 

A  hash  table  represents  the  IQA  semantie  tree.  IQA  builds  this  hash  table  from  a 
text  file  eaeh  time  IQA  runs.  This  file  eontains  the  primary  word  and  pointers  to  any 
hyponym(s)  assoeiated  with  it.  Eaeh  entry  also  ineludes  a  pointer  to  its  hypemym,  unless 
the  word  is  the  semantie  tree  root.  The  final  entry  is  a  Boolean  flag  used  to  determine 
whether  the  eurrent  seareh  iteration  has  visited  this  entry  (node).  This  flag  is  the  only  data 
item  that  ehanges  in  the  semantie  tree  during  IQA  exeeution  after  the  initial  ereation  of 
the  semantie  tree  hash  table.  Finally,  IQA  assumes  that  all  query  terms  exist  in  the 
semantie  tree. 
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3.4,2  Test  Data 


A  small  example  set  of  data  faeilitates  the  design,  eoding  and  testing  of  IQA.  This 
test  data  eonsists  of  a  elass  of  shapes  with  attributes  of  size  and  eolor.  There  are  three 
different  shapes,  as  well  as  three  of  eaeh  type  of  attributes.  Figure  3-1  shows  a 
visualization  of  the  semantie  tree  that  the  test  data  hash  table  represents. 

3,5  IQA  Implementation 

IQA  ineludes  five  major  areas;  doeument  seareh,  doeument  elassifieation,  rule 
learning,  rule  generalization,  and  rule  rewrite.  The  following  seetions  present  a  detailed 
deseription  of  IQA’ s  implementation. 

3.5.1  Document  Search 

When  a  user  performs  a  seareh,  the  IQA  system  exeeutes  four  steps.  The  first  step 
generates  a  found  rule  list,  followed  by  the  generation  of  a  pseudo  generalized  rule  list. 
IQA  then  finds  all  doeuments  matehing  any  the  seareh  terms  and  rules  on  both  rule  lists, 
and  then  and  generates  a  doeument  mateh  list.  IQA  then  assigns  a  relevanee  seore  to  eaeh 
doeument  and  returns  that  ordered  doeument  mateh  list  for  user  elassifieation. 

3.5.1. 1  Found  Rule  List 

IQA  stores  new  and  updated  rules  to  a  disk  file  after  eaeh  seareh.  Rule  seareh 
terms,  rule  doeument  terms  and  a  rule  weight  make  up  eaeh  rule  objeet.  The  user  enters 
the  seareh  terms  st  to  find  existing  rules  with  matehing  rule  seareh  terms  {rst).  An  exaet 
mateh  oeeurs  when  for  all  terms  e  st)  a  e  rst)  a  {\st\  =  \rst\)  .  That  is  st  =  rst . 

Seareh  terms  ean  have  one  or  more  matehing  rules. 
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3.5,1,2  Pseudo  Generalized  Rule  List 


The  strength  of  IQA  is  its  capaeity  for  building  rules  from  semantieally  related 
terms.  The  generation  of  a  pseudo-generalized  rule  list  has  four  steps;  finding  rules, 
removing  duplicates,  generalizing  terms  and  adjusting  rule  weights. 

3.5.1.2.1  Semantically  Similar  Rule  Search 

IQA  iterates  through  each  search  term  and  searches  the  semantic  tree  to  find  rules 
that  exist  within  the  semantic  distance  (6)  threshold  of  the  current  term.  When  it  finds  a 
rule  within  the  5  threshold,  IQA  stores  the  rule  to  a  pseudo  generalized  rule  list.  This 
continues  until  IQA  exhausts  all  the  search  terms. 

3.5.1.2.2  Duplicate  Rule  Removal 

IQA  then  removes  (prunes)  duplicate  rule  objects  in  the  list  by  a  direct 
comparison.  Duplicate  rule  objects  occur  because  rule  objects  with  multiple  terms  have 
multiple  associations  with  terms  in  the  semantic  tree.  For  example,  the  rule  [RED 
CIRCLE]  ^  [BIG  RED  CIRCLE]  has  two  rule  search  terms  [RED]  and  [CIRCLE].  A 
search  for  the  terms  [BLUE  CIRCLE]  reveals  no  rules,  so  IQA  then  begins  a  search  of 
the  semantic  tree  to  find  semantically  close  rules.  This  search  returns  a  rule  associated 
with  the  term  [RED]  and  one  associated  with  the  term  [CIRCLE].  In  this  instance,  the 
rules  are  the  same.  IQA  returns  both  rules  because  the  term  [BLUE]  is  a  6  of  2  from 
[RED],  and  the  term  [CIRCLE]  is  exactly  matched  (5=0).  The  total  5  of  all  found  terms  is 
less  than  or  equal  the  5  threshold.  There  are  two  identical  rules  returned  by  semantic 
generalization,  so  IQA  deletes  one. 
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3.5,1,2.3  Term  Generalization 

The  generalization  step  involves  a  swap  of  the  seareh  term  and  rule  term. 
Consider  the  seareh  terms  [BLUE  CIRCLE]  and  the  list  ineluding  one  rule  [RED 
CIRCLE]  ^  [BIG  RED  CIRCLE],  The  swap  funetion  swaps  the  semantieally  similar 
terms  ([BLUE]  for  [RED])  in  the  pseudo  generalized  rule  object.  The  resulting  pseudo 
generalized  rule  is  [BLUE  CIRCLE]  =>  [BIG  BLUE  CIRCLE]. 


3,5,1.2,4  Generalize  Rule  Weight  Adjustment 

The  final  step  computes  the  pseudo  generalized  rule  weight.  The  generalize  rule 
weight  {grw)  is  based  upon  the  following  equation. 


1  ,  1  ^ 

grw  =  orw  ^  ( - 1 - ) 

1  +  ^  2  +  rc 


13-1] 


The  grw  equals  the  original  rule  weight  {orw  multiplied  by:  the  sum  of  the  reciprocal  of  1 
plus  the  5  and  the  reciprocal  of  2  plus  the  rule  count.  The  rule  count  equals  the  total 
number  of  rules  returned  with  the  original  search  terms. 

The  second  half  of  the  equation  ensures  that  the  original  rule  weight  will  never  be 
greater  than  a  generalized  rule.  If  there  are  no  rules  found  at  the  node  terms  {rc=0),  but 
there  is  a  rule  found  with  a  6  of  1  (at  the  parent,  5=1)),  then  the  second  half  of  the 
equation  will  equal  1.  Note  that  the  6  will  always  be  at  least  1.  This  means  the 
generalized  rule  weight  at  the  parent  is  not  scaled  down  if  the  search  terms  do  not  return  a 
rule  at  the  root  terms.  This  is  highly  desirable  since  the  purpose  of  the  system  is  to 
generalize  rules  whenever  possible. 
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If  there  are  one  or  more  rules  returned  for  the  original  seareh  terms,  the  rule 
weight  seales  down  by  a  faetor  that  direetly  depends  upon  the  number  of  those  rules,  as 
well  as  the  6  of  the  generalized  rule  found.  For  a  set  of  seareh  terms  that  finds  no 
matehing  rules  (rc=o),  but  does  finds  one  rule  to  generalize  that  is  a  6  of  2  away  from  the 
original  term,  the  generalize  rule  weight  equals: 

env  =  orw  *  ( — — i - — )  =  orw  *  (— )  [3-2] 

1  +  2  2  +  0)  6 

3.5.1.3  Document  Match 

During  search,  IQA  compares  the  disjunction  of  search  terms  to  each  document 
and  adds  any  document  with  one  or  more  matching  search  terms  to  a  document  match 
list.  It  then  compares  the  rule  search  terms  of  rule  objects  on  both  rule  lists  to  all 
documents  and  adds  any  document  with  an  exact  match  to  the  document  match  list, 
exactly  like  the  find  rule  match  in  Section  3. 5. 1.1. 

3.5.1.4  Relevance  Score 

A  document’s  relevance  score  consists  of  the  number  of  original  search  term  hits, 
rules  weights  of  satisfied  rules  and  pseudo  generalized  rules  satisfied,  and  an  additional 
value  of  2  for  each  rule  satisfied.  The  document  relevance  score  {rsd)  is: 

n  ” 

rs^  =  —  (V  {{sw  +  rWj )  *  rw- ))  +  2rm  [3-3] 

51V 

In  this  equation,  n  is  the  number  of  rules  satisfied,  rsd  is  the  relevance  score,  sw  is  the 
search  weight,  nv,  is  the  rule  weight  of  the  rule  satisfied,  and  rm  is  the  number  of  rule 
matches. 


33 


Rule  matches  {rm)  increase  the  relevance  score  by  2  for  each  rule  matched.  The 
other  portion  of  the  rs  equation  depends  on  a  summation  of  the  sw  and  rw  for  each  rule 
matched.  The  multiplication  of  rw  differentiates  ensures  rw  <  \  force  a  smaller  (often 
times  much  smaller)  overall  rs.  The  relevance  score  increases  with  the  number  of 
matching  rules,  as  well  as  matching  pseudo  generalized  rules. 

IQA  prunes  the  returned  document  list  to  reduce  the  number  of  documents 
returned  in  large  rule  sets.  IQA  removes  documents  with  a  relevance  score  of  less  than 
(0.5)5t  where  st  is  the  number  of  original  search  terms.  Searches  with  large  numbers  of 
terms  can  return  large  numbers  of  documents  even  without  rules,  since  the  term  search 
uses  a  disjunction  of  terms  (or-ed.)  This  pruning  prohibits  IQA  from  returning 
superfluous  (ultra-low  scoring)  documents. 

3.5.2  User  classification 

IQA  presents  the  user  with  a  matching  document  list  sorted  by  relevance  score 
from  highest  to  lowest.  A  matching  document  consists  of  the  search  terms,  the  document 
terms  and  the  relevance  score.  The  user  classifies  each  returned  document  as  positive 
(good),  negative  (bad),  or  non-applicable  (not  classified.)  Once  the  user  has  classified  the 
documents,  the  rule  learning  process  begins. 

3.5.3  Rule  Learning  Process 

The  rule  learning  portion  of  this  research  uses  the  FOIL  algorithm  [QUI91].  The 
rule  learning  process  consists  of  FOIL,  the  gain  function  and  the  Winnow  Algorithm. 
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FOIL  learns  rule  terms  based  on  the  gain  funetion.  Once  rules  are  learned,  the  Winnow 
algorithm  updates  all  rule  weights. 


3.5.3.1  FOIL 

FOIL  learns  sets  of  first  order  rules  using  a  separate-and-conquer  approach 
[PAZ92].  The  rule  search  terms  are  a  disjunction  of  literals  while  the  rule  document 
terms  are  a  conjunction  of  literals.  Figure  3-2  outlines  the  FOIL  algorithm.  It  starts  with 
an  empty  rule  and  loops  through  positive  and  negative  examples  (separate).  For  each 
literal,  FOIL  calculates  the  information  gain  using  an  entropy  function,  as  discussed  in 
section  3. 5. 3. 2.  The  literal  with  the  highest  gain  is  added  to  the  antecedent  list  (conquer). 
FOIL  removes  negative  examples  that  do  not  satisfy  the  literal  and  repeats  until  there  are 
no  more  negative  examples.  Once  there  are  no  more  examples  in  the  negative  set,  the 
antecedent  list  is  stored  as  the  next  rule. 


LetA  =  {} 

LetR={} 

Let  P  be  the  cnn  ent  set  of  uncovered  positives 
Let  N  be  the  set  of  all  negative  examples 
llntd  P  IS  empty  do 
Until  N  IS  empty  do 

For  every  feature-value  pan  (hteial)  Fi=Vj,  calculate  CTam(Fi=Vj,  P,  N) 
Pick  the  hterak  L,  with  the  lugliest  gam. 

Add  L  to  A. 

Remove  tfom  N  examples  that  do  not  satisfy  L. 

Return  the  rale:  A  =  Z,  a  Z2  a  ...  a  Z„  ='  Positive 
Add  A  to  R. 

Let  N  be  the  set  of  all  negative  examples 
ReiiKwe  from  P  examples  that  satisfy  .A|. 

Return  the  rale  set:  R  =  Aj  v  Aj  v ...  v.4„  =>  Positire 


Figure  3-2:  FOIL  Algorithm 
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FOIL  adds  all  negative  examples  baek  to  the  original  set,  and  all  positive 
examples  that  match  the  new  rule  are  removed  from  the  positive  set.  If  there  are  still 
positive  examples  remaining,  IQA  creates  a  new  rule  with  no  terms  (separate).  The 
process  continues  until  no  more  positive  examples  exist.  The  result  is  one  or  more  rules 
that  best  cover  the  positive  examples. 

3.5.3.2  The  Gain  Function 

As  FOIL  compares  each  of  the  search  terms  (referred  to  as  literals  in  this  section), 
a  gain  function  chooses  which  literal  to  add  to  a  specific  rule.  Foil  Gain  is  calculated  for 
each  literal  within  a  {conquer)  loop,  and  the  literal  with  the  maximum  gain  is  returned.  If 
there  are  two  or  more  literals  with  equal  maximum  gains,  then  FOIL  uses  the  first  literal 
for  consistency.  Equation  3-4  illustrates  the  Foil  Gain  algorithm  [MIT97]. 

Foil  _  Gain{L,  R)  =  \p\  *  (logj  (  ,  ,  )  -  logzC,, 

{\p\  +  \n\)  (n  +  H) 

where  L  represents  the  current  literal  candidate  for  rule  R.  p  represents  the  subset  of 
examples  (documents)  in  P  that  satisfy  L.  n  is  the  subset  of  examples  in  N  that  satisfy  L. 
Foil  uses  cardinalities  of  these  four  terms  for  computing  the  gain  for  a  given  literaland 
estimates  the  utility  of  adding  a  new  literal  based  on  the  numbers  of  positive  and  negative 
examples  covered  before  and  after  adding  the  new  literal  [MIT97]. 

3.5.3.3  The  Winnow  Algorithm 

Foil  adjusts  rule  weights  for  all  rules  using  the  Winnow  algorithm  [BLU97].  The 

rule  weight  is  the  basis  for  the  rule  strength  and  is  an  integral  part  of  the  relevance  score 

responsible  for  returned  document  order.  Each  time  a  user  classifies  a  matching 
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document,  FOIL  adjusts  the  assoeiated  rule  weight.  Seetion  3. 5. 4. 3  diseusses  the  need  for 
adjusting  all  rule  weights  after  each  search,  regardless  of  classification  or  use. 

When  a  user  classifies  a  doeument  as  positive,  the  Winnow  algorithm  increases 
that  associated  rule’s  weight  by  multiplying  the  current  rule  weight  by  the  positive  rule 
adjustment  faetor.  This  reinforees  the  rule’s  validity  by  strengthening  the  relationship 
between  the  rule  search  terms  and  the  rule  document  terms.  Winnow  decreases  a 
negatively  elassified  doeument’ s  assoeiated  rule  weight  for  the  contrary  reason, 
weakening  the  rule  strength.  Table  3-1  shows  the  amount  Winnow  adjusts  each  rule 
weight  depending  on  its  olassifieation. 


Table  3-1:  Winnow  Adjnstment  Factor 


Classification 

Rule  Adjustment  Factor 

Positive  (good) 

1.5 

Negative  (bad) 

0.5 

One  category  the  Winnow  algorithm  does  not  address  is  adjusting  rule  weights  of 
rules  elassified  as  non-applieable  or  rules  not  used.  This  eapability  is  important  for  two 
reasons.  First,  it  provides  a  way  of  differentiating  between  two  similar  rules  ereated 
during  different  seareh  iterations.  Eaeh  rule  ereated  starts  with  an  initial  rule  weight  of 
1.0.  Suppose  FOIL  ereates  two  similar  rules  30  search  iterations  apart.  Without  any  rule 
weight  adjustment  for  rules  not  used,  the  two  rules  weights  would  be  the  same  in  a 
subsequent  search  that  fires  both  rules.  This  is  undesirable,  since  a  reeently  learned  rule 
has  a  higher  relevanee  to  the  eurrent  seareh  eriteria.  The  second  reason  is  for  providing  a 
way  of  identifying  a  rule  that  is  no  longer  used. 
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IQA  introduces  a  specific  rule  adjustment  factor  for  any  rules  users  do  not  classify 
or  are  not  used  during  a  search.  Using  a  rule  adjustment  factor  less  than  but  close  to  1.0 
for  non-classified  or  unused  rules  provides  a  method  to  subtly  reduce  the  rule  weight  over 
time. 


Table  3-2:  Modified  Winnow  Adjnstment  Factor 


Classification 

Rule  Adjustment  Factor 

Positive  (good) 

1.5 

Negative  (bad) 

0.5 

Not  classified 

0,9 

3,5,4  Generalization 

Rule  generalization  is  the  method  of  reducing  the  number  of  rules  in  the  rule  file. 
IQA  uses  three  separate  processes  to  accomplish  this;  rule  promotion,  rule  assimilation 
and  rule  aging.  The  rule  promotion  occurs  when  a  majority  of  similar  rules  exist  on  the 
same  semantic  level  under  a  single  parent  node.  Rule  assimilation  is  necessary  when 
FOIL  learns  a  specialized  rule  and  there  already  exists  a  more  general  form  of  the  rule  at 
the  parent  node  on  the  semantic  tree.  Rule  aging  occurs  when  a  rule’s  weight  drops  below 
the  usefulness  threshold. 

The  semantic  tree  is  used  to  determine  semantic  level,  parent  node  and  5  of  rules 
under  consideration.  Traversing  the  tree  is  necessary  for  determining  the  need  to  promote 
and  or  assimilate  rules.  IQA  considers  only  the  rule  weight  when  aging  rules.  The  most 
computationally  conservative  approach  is  to  iterate  through  the  existing  rules  and  search 
the  semantic  tree  term  by  term.  This  also  gives  a  starting  place  for  the  search,  which  is 
the  first  search  term  of  the  first  rule. 
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3.5,4,1  Rule  Promotion 


Rule  promotion  occurs  when  IQA  combines  specific  rules  into  a  more  general 
rule.  IQA  considers  rules  for  promotion  when  the  following  criteria  is  met.  First,  the  rules 
must  be  on  the  same  semantic  level.  Next,  the  rules  must  also  have  the  same  parent, 
meaning  the  rules  are  separated  by  a  5  of  2.  This  distinction  of  6  combined  with  the  same 
parent  is  necessary  because  simply  having  a  6  of  2  could  cause  an  attempted  match  of  a 
rule  with  one  at  its  semantic  grandparent’s  level.  Since  one  of  the  criteria  for  rule 

promotion  is  being  on  the  same  semantic  level,  avoiding  this  situation  conserves 

computational  effort.  The  final  criterion  is  that  the  number  of  similar  rules  must  represent 
a  majority  of  the  total  number  of  siblings  under  the  parent.  If  IQA  considers  two  rules  for 
promotion  and  finds  that  three  siblings  exist  at  the  semantic  level  under  the  parent  of 
consideration,  then  a  majority  exists.  IQA  promotes  the  rules.  This  concept  is  illustrated 
with  two  similar  rules  present  under  a  parent  with  three  siblings: 

(((^h  V  5^2  V  ...  V  St^  V  Tly)  ^  ((/tj  V  dt^  V  ...  V  dt^  V  tly))  A 

((^h  V  st^  V ...  V  st^  V  n^)  ^  (Jtj  V  dt^  v ...  v  dt^  v  n^)))  ^  [3.5] 

((^h  V  5^2  V  ...  V  St„  V  Py  J  =>  (Jtj  V  flft2  V  ...  V  dt^  V  p^  J) 

where  stx  is  a  rule  search  term  while  dtx  is  a  rule  document  term.  In  each  of  the  two  rules, 
all  the  terms  match  with  exception  of  terms  Uy  and  n^.  These  terms  have  the  same  parent 
term,  denoted  by  py^z.  Since  a  majority  of  the  siblings  have  matching  rules  per  equation  3- 
5,  the  rules  are  candidates  for  promotion.  For  example,  using  the  shape  domain  of  Figure 
X,  if  the  two  rules  [SMALL  RED]  ^  [SMALL  RED  CIRCEE]  and  [SMAEE  GREEN] 
^  SMAEE  GREEN  CIRCEE]  is  generalized  to  [SMALE  COEOR]  ^  [SMALE  COEOR 
CIRCEE] 
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IQ  A  builds  temporary  promote  rule  objeets  from  eaeh  rule  that  fires.  One  objeet  is 
built  for  each  rule  search  term  and  represents  a  pointer  matching  this  rule  to  its  search 
term  on  the  semantic  tree.  For  a  rule  with  two  rule  search  terms,  IQA  creates  two  separate 
promote  rule  objects.  The  difference  in  each  of  these  rule  objects  is  the  current  node  (the 
rule  search  term)  and  the  node’s  parent.  This  current  node  provides  the  semantic  location 
of  a  rule  which  is  crucial  when  considering  rules  for  promotion.  Table  3-3  shows  the 
promote  objects  for  the  rule  [SMALL  CIRCLE]  —>■  [SMALL  RED  CIRCEE]. 


Table  3-3:  Rule  Objects 


Rule  Object  1 

Rule  Object  1 

Node 

SMAEE 

CIRCEE 

Parent 

SIZE 

SHAPE 

Rule 

[SMAEE  CIRCEE]  ^ 
[SMAEE  RED  CIRCEE] 

[SMAEE  CIRCEE]  ^ 
[SMAEE  RED  CIRCEE] 

Once  IQA  builds  all  the  rule  objects,  it  compares  each  rule  object  to  every  other. 
When  it  finds  two  rule  objects  with  the  same  parent,  it  compares  the  rule  search  terms 
and  rule  document  terms,  less  the  node  terms.  When  a  match  is  found,  IQA  creates  a  rule¬ 
matching  object  and  stores  this  information  for  future  use.  This  continues  through  each  of 
the  promote  rule  objects. 

IQA  iterates  through  rule-matching  objects  to  determine  if  a  majority  of  matching 
rules  exist  for  a  given  parent.  If  a  majority  exists,  the  system  marks  the  first  rule  for 
promotion  and  all  subsequent  matching  rule  objects  for  deletion.  The  rule-matching 
object  accumulates  all  matching  rule  weights  for  calculating  a  new  rule  weight  for  the 
promoted  rule.  If  a  majority  of  rules  does  not  exist,  then  IQA  promotes  none.  If  a  single 
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promote  rule  object  exists  in  a  matching  rule  object,  then  no  action  is  required.  Figure  3-3 
shows  rules  before  and  after  a  successful  rule  promotion  iteration. 


IQA  calculates  the  promoted  rule  weight  as: 

n 

(^mrw.) 

prw  =  — - - xl.5”  [3-6] 

m 

where  prw  is  the  promoted  rule  weight,  n  is  the  number  or  matching  rules,  mrwi  is  the 
matching  rule  weight,  m  is  the  total  number  of  siblings  under  the  parent  of  the  matching 
rules. 
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This  equation  takes  the  sum  of  all  matehing  rule  weights  and  divides  by  the  total 
number  of  node  siblings.  It  multiplies  that  value  by  the  positive  Winnow  classifieation 
faetor  of  1.5  taken  to  the  power  of  the  number  of  matehing  rules.  This  applies  a  positive 
elassifieation  to  the  promoted  rule  n  times.  This  is  desirable  and  the  result  is  a  positive 
elassifieation  for  eaeh  matehing  rule.  The  promoted  (generalized)  rule  weight  is  greater 
than  the  sibling  rules  in  all  promotion  instanees. 

3,5,4,2  Rule  Assimilation 

Rule  assimilation  oeeurs  when  IQA  deletes  a  speeialized  rule  beeause  a  more 
general  rule  already  exists.  This  proeess  prevents  IQA  storing  learned  speeialized  rules 
already  generalized,  thereby  diminishing  the  eomputational  effort  of  searehes. 

This  proeess  uses  the  promote  rule  objeets  and  eompares  them  to  eaeh  other.  For 
eaeh  promote  rule,  it  eheeks  subsequent  ones  to  see  if  the  promote  rule  node’s  parent 
matehes  the  eompared  promote  rule’s  node.  If  so,  IQA  eompares  the  rule  seareh  terms 
and  rule  doeument  terms  to  see  if  they  mateh,  less  the  node  terms.  If  they  mateh,  the 
primary  promote  rule  objeet  is  a  eandidate  for  assimilation. 

During  assimilation,  IQA  marks  the  eandidate  for  deletion  and  adjusts  the 
generalized  promote  rule  objeet  rule  weight  by  a  faetor  of  1.5.  This  adjustment  is 
neeessary  sinee  the  user  elassified  the  doeument  returned  by  this  speeialized  rule  as 
positive.  Figure  3-4  illustrates  rule  assimilation. 
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3,5,4,3  Rule  Aging 

Users  periodically  misclassify  returned  documents.  The  possibility  also  exists  that 
a  unique  rule  never  fires  again.  IQA  uses  rule  aging  to  delete  unnecessary  rules  from  the 
rules  data  file.  Since  many  of  the  IQA  functions  depend  directly  on  all  existing  rules,  it  is 
computationally  desirable  to  minimize  the  number  of  rules. 


Figure  3-4:  Rule  Assimilatiou 

The  Winnow  algorithm  provides  a  way  for  subtly  adjusting  the  rule  weight  in 
cases  of  rule  non-use,  or  dramatically  reducing  the  rule  weight  of  a  rule  associated  with  a 
negatively  classified  document.  Rule  aging  flags  a  promote  rule  object  for  deletion  if  its 
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rule  weight  drops  below  a  usefulness  threshold.  This  threshold  is  set  at  0.015.  If  IQA 
creates  a  rule  that  never  fires,  it  takes  40  search  iterations  to  reduce  a  rule’s  weight  below 
this  threshold.  Each  time  a  user  classifies  a  rule  as  negative,  it  reduces  the  number  of 
search  iterations  before  aging  by  3.  A  learned  rule  classified  as  negative  on  subsequent 
search  iterations  takes  7  iterations  to  reduce  the  rule’s  weight  below  the  rule  aging 
threshold.  Appendix  B  lists  the  table  that  shows  the  effects  of  non-  or  negative 
classifications  on  rule  weights. 

3,5,5  Rule  Rewrite 

IQA’s  final  step  derives  updated  rules  from  the  promote  rule  objects.  This  portion 
checks  each  promote  rule  object  to  ascertain  its  deletion  status  and  rule  weight.  Promote 
rule  objects  that  have  a  true  deletion  status  flag  or  a  rule  weight  below  the  usefulness 
threshold  are  ignored.  IQA  writes  all  other  promote  rule  objects  to  the  rules  data  file.  This 
rule  object  includes  the  rule  search  terms,  the  rule  document  terms,  and  the  rule  weight. 
IQA  is  now  ready  for  the  next  query. 

3,6  Summary 

This  chapter  discusses  the  design  of  IQA  in  detail.  It  describes  the  methodology 
and  purpose  of  each  concept,  and  presents  implementations  of  the  most  prominent 
functions  in  detail.  It  also  illustrates  the  FOIL  algorithm,  as  well  as  equations  unique  and 
or  essential  to  this  application.  Examples  assist  with  comprehension  of  IQA’s  intent.  The 
next  chapter  discusses  IQA  testing,  data  gathering  and  results  analysis. 
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4,  IQA  Evaluation  and  Results 


4.1  Introduction 

This  chapter  describes  the  techniques  used  to  evaluate  IQA.  It  also  presents  the 
results  of  the  main  researeh  goal  of  developing  IQA,  a  rule  learning  system  that 
generalizes  rules  aeross  semantieally  similar  words.  An  analysis  follows  the  empirieal 
results  diseussing  IQA’s  effeetiveness. 

FOIL  learns  rules  by  greedily  adding  terms  based  upon  a  gain  function.  An 
examination  of  the  rule  weight  follows,  ensuring  it  is  initially  set  properly  and  updated 
through  future  searches  per  its  elassifieation.  The  next  step  demonstrates  how  existing 
rules  can  affect  the  relevanee  of  returned  doeuments  and  shows  an  example  of  how  rule 
weights  affect  document  return  order.  After  IQA  builds  rules,  those  rules  can  be  pseudo 
generalized  for  eurrent  seareh  terms  using  the  semantic  tree  and  existing  rules. 

The  rule  generalization  funetion  ineludes  rule  promotions,  rule  assimilation  and 
rule  aging.  Tests  demonstrate  eaeh  of  these  eapabilities  and  log  fdes  verily  the  expeeted 
results.  Finally,  automated  testing  compares  IQA’s  ability  to  return  relevant  reeords  in  a 
eorreet  sequenee  based  on  rule  weights,  and  a  positively  elassified  document  eounter. 

4.2  Test  Environment 

The  IQA  system  uses  the  Java  programming  language,  JDK  version  1.4.1.  The 
software  is  developed  and  tested  on  a  computer,  with  an  AMD  2.0  GHz  proeessor  with 
512  MB  of  dual-ehannel  RAM. 
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4,3  Test  Data 


IQA  uses  two  sets  of  data  for  testing.  The  first  set  eontains  one  objeet  type 
(shapes)  and  two  deseriptors  (eolors  and  sizes),  and  is  referred  to  as  the  ‘Shape  Dataset’ 
from  this  point  forward.  IQA  uses  the  ‘Shape  Dataset’  to  validate  rule  learning,  the 
Winnow  algorithm,  pseudo  generalization,  and  rule  generalization  (ineluding  rule 
promotion,  rule  assimilation  and  rule  aging).  Table  4-1  shows  the  terms  used  for  the 
shape  data  test  set.  Combinations  of  these  three  term  types,  one  of  eaeh  type,  make  up  27 
doeuments  that  IQA  searehes.  Additional  one  and  two  term  eombinations  make  up  21 
more  reeords,  eaeh  eonsisting  of  at  least  a  shape  and  a  deseriptor,  bringing  the  total 
number  of  test  reeords  to  48.  Appendix  C  ineludes  the  ‘Shape  Dataset.’ 


Table  4-1:  Shape  Data  Terms 


SHAPE 

COLOR 

SIZE 

CIRCLE 

RED 

SMALL 

SQUARE 

BLUE 

MEDIUM 

TRIANGLE 

GREEN 

BIG 

Deelassified  data  from  the  NAIC  lEC  system  makes  up  the  ‘NAIC  Dataset.’  This 


data  eonsists  of  a  subset  of  data  from  the  NAIC  lEC  system.  The  NAIC  Dataset  provides 


a  more  realistie  test  environment  for  gathering  results  on  doeument  relevanee  sequenee 


tests. 


4,4  FOIL 

As  diseussed  in  ehapter  3,  EOIE  uses  search  terms  and  applicable  documents 


returned  in  rule  building.  To  show  that  IQA  adds  the  correct  terms  to  rules  from  a  search. 
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two  searches  provide  gain  values  of  the  terms  and  an  evaluation  determines  if  IQ  A  selects 
the  appropriate  terms  for  rules.  The  FOIL  test  uses  shape  data  only. 

The  first  test  uses  the  search  terms  [SMALL  TRIANGLE]  and  returns  a  set  of 
documents  from  IQA.  There  are  no  learned  rules  at  this  point.  The  tester  classifies  the 
document  [SMALL  RED  TRIANGEE]  as  positive,  and  all  others  as  non-applicable.  IQA 
assigns  gain  values  for  associated  terms,  shown  in  Table  4-2. 


Table  4-2:  Gain  Test  -  [RED  TRIANGLE]  Search 


TERM 

GAIN 

SMALL 

1.000 

TRIANGLE 

0.585 

RED 

2.000 

Maxterm  =  RED 

2.000 

SMALL 

1.000 

TRIANGLE 

0.585 

Maxterm  =  SMALL 

1.000 

TRIANGLE 

1.585 

Maxterm  =  TRIANGLE 

1.585 

IQA  identifies  three  maximum  terms.  These  terms  should  make  a  rule  with  the 
search  terms  of  [SMALL  TRIANGLE]  ^  [  RED  SMALL  TRIANGLE],  since  those 
were  the  terms  that  IQA  identified  as  having  the  maximum  gain.  Eigure  4-1  shows  a  log 
output  of  the  log  fide. 

Eigure  4-1  shows  that  the  initial  search  creates  one  rule,  and  that  rule  matches  the 
results  shown  in  Table  4-2.  This  result  is  expected  since  we  classified  only  one  document 
as  positive,  and  that  document  makes  up  the  rule  terms.  The  possibility  exists  that  two  or 
more  terms  can  have  the  same  highest  gain  value.  When  this  occurs,  TOIL  stores  the  first 
term  with  that  value  as  the  next  rule  term.  Duplicate  gain  value  become  less  probable 


when  dealing  with  less  symmetric  and  equally  distributed  datasets,  such  as  the  NAIC 
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data.  Note  that  the  order  of  doeument  terms  in  a  rule  does  not  affeet  the  performanee  of 


IQA. 


NEW  RULES 


Total  number  of  new  Rules  =  1 


[[SMALL,  TRIANGLE],  [RED,  SMALL,  TRIANGLE],  1.0] 
****************************************************** 


Figure  4-1:  IQA  Log  File  Segment-  [SMALL  TRIANGLE]  Search 


Next,  the  tester  searehes  for  [RED  TRIANGLE]  this  test  uses  the  eurrent  existing 


data  and  rule.  This  time  the  tester  elassifies  all  doeuments  with  the  terms  [RED 


TRIANGLE]  as  positive.  Table  4-2  shows  the  gain  results.  This  time  IQA  identifies  two 


maximum  terms,  [RED  TRIANGLE].  Eigure  4-2  shows  part  of  the  IQA  log  file  after  this 


test,  and  the  gain  values  for  eaeh. 


Total  number  of  classified  results  =  24 


[[RED,  TRIANGLE],  [RED,  TRIANGLE],  1] 

[[RED,  TRIANGLE],  [BIG,  RED,  TRIANGLE],  1] 
[[RED,  TRIANGLE],  [MEDIUM,  RED,  TRIANGLE],  1] 
[[RED,  TRIANGLE],  [SMALL,  RED,  TRIANGLE],  1] 
[snip] 


Total  number  of  Rules  =  2 


[[SMALL,  TRIANGLE],  [RED,  SMALL,  TRIANGLE],  0.9] 
[[RED,  TRIANGLE],  [RED,  TRIANGLE],  3.375] 


[snip] 


Figure  4-2:  IQA  Log  File  Segment  -  [RED  TRIANGLE]  Search 


4,4,1  Rule  Weight  Updates 

IQA  updates  all  rule  weights  after  eaeh  seareh  iteration.  IQA  assigns  a  weight  of 
1.0  to  eaeh  new  rule.  Eigure  4-1  shows  the  new  rule  set  after  the  first  seareh  with  this 
value.  If  a  rule  returns  a  doeument  and  the  user  elassifies  it  as  positive,  then  IQA 

multiplies  the  eurrent  rule  weight  by  1.5.  If  a  rule  returns  a  doeument  and  the  user 
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classifies  it  as  negative,  then  IQ  A  multiplies  the  eurrent  rule  weight  by  0.5.  IQ  A 
multiplies  the  eurrent  rule  weight  by  0.9  under  the  following  eonditions: 

•  the  user  elassifies  the  doeument  as  non-applieable, 

•  the  user  does  not  elassify  the  doeument,  or 

•  the  rule  is  not  used  during  the  eurrent  seareh  iteration. 


Table  4-3:  Gain  Test  -  [SMALL  TRIANGLE]  Search 


TERM 

GAIN 

RED 

2.211 

TRIANGLE 

2.211 

SMALL 

0.000 

RED 

2.211 

TRIANGLE 

2.211 

MEDIUM 

0.000 

RED 

2.211 

TRIANGLE 

2.211 

BIG 

0.000 

Maxterm  =  RED 

2.211 

TRIANGLE 

4.755 

SMALL 

0.000 

RED 

0.000 

TRIANGLE 

4.755 

MEDIUM 

0.000 

RED 

0.000 

TRIANGLE 

4.755 

BIG 

0.000 

Maxterm  =  TRIANGLE 

4.755 

If  the  rule  returns  multiple  doeuments,  then  there  are  multiple  elassifieations  and 
IQA  updates  its  rule  weight  aeeordingly.  Figure  4-2  shows  the  rule  output  from  the 
seeond  seareh.  The  rule  [SMALL  TRIANGLE]  ^  [RED  SMALE  TRIANGEE]  does  not 
fire,  and  therefore  is  updated  by  a  faetor  of  0.9.  The  rule  [RED  TRIANGEE]  ^  [RED 
TRIANGEE]  shows  a  final  rule  weight  of  3.375,  counterintuitive  to  the  initial  value  set  at 
1.0.  This  is  eorreet  sinee  the  user  elassifies  the  four  doeuments  matehing  that  rule  as 
positive  during  the  same  iteration.  The  initial  rule  weight  is  set  at  1.0  per  the  Winnow 
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algorithm.  The  subsequent  rule  elassifieations  inerease  by  a  faetor  of  1.5  three  times, 
raising  it  to  1.5,  2.25,  and  then  finally  to  3.375. 

4,5  Rule  Relevance 

IQA  uses  a  relevanee  weight  to  sort  doeuments  returned  for  user  elassifieation.  As 
diseussed  in  ehapter  3,  this  relevanee  weight  eonsists  of  the  number  of  seareh  terms 
found,  the  number  of  rules  that  mateh  the  doeument  and  the  weights  of  those  rules  found. 
IQA  returns  doeuments  in  relevanee  weight  (also  known  as  relevanee  seore)  order,  from 
highest  to  lowest.  This  test  demonstrates  how  a  rule’s  weight  affeets  the  way  IQA  sorts 
returned  doeuments. 

The  tester  now  searehes  for  the  terms  [RED  TRIANGLE]  in  the  third  seareh  using 
the  rule  data  from  the  previous  tests.  This  time  the  tester  elassifies  the  doeument  [BIG 
RED  TRIANGLE]  as  positive,  and  all  others  as  non-applieable.  The  fourth  seareh  uses 
the  same  seareh  terms.  This  time  the  tester  elassifies  only  [MEDIEIM  RED  TRIANGLE] 
as  positive  and  all  other  doeuments  as  non-applieable.  Eigure  4-3  shows  the  results  for 
the  third  and  fourth  searehes.  Note  that  the  rule  [RED  TRIANGLE]  ^  [RED 
TRIANGLE]  has  a  rule  weight  that  eontinues  to  inerease.  The  rule  terms  mateh  other  rule 
terms  while  the  implied  doeument  terms  are  a  subset  of  other  rule  implied  doeument 
terms.  This  indieates  rule  validation  by  positive  elassifieation,  and  therefore  the  rule 
weight  aeeordingly. 
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THIRD  SEARCH 
[snip] 

Total  number  of 

classified  results  =  24 

[ [RED,  TRIANGLE] 
[ [RED,  TRIANGLE] 
[ [RED,  TRIANGLE] 
[ [RED,  TRIANGLE] 
[snip] 

,  [RED,  TRIANGLE],  0] 

,  [BIG,  RED,  TRIANGLE],  1] 

,  [MEDIUM,  RED,  TRIANGLE],  0] 

,  [SMALL,  RED,  TRIANGLE],  0] 

Total  number  of 

new  Rules  =  3 

[[SMALL,  TRIANGLE],  [RED,  SMALL,  TRIANGLE],  0.81] 

[[RED,  TRIANGLE],  [RED,  TRIANGLE],  3.690562] 

[[RED,  TRIANGLE],  [BIG,  RED,  TRIANGLE],  1.0] 
****************************************************** 

[snip] 

FOURTH  SEARCH 
[snip] 

Total  number  of 

classified  results  =  24 

[ [RED,  TRIANGLE] 
[ [RED,  TRIANGLE] 
[ [RED,  TRIANGLE] 
[ [RED,  TRIANGLE] 
[snip] 

,  [BIG,  RED,  TRIANGLE],  0] 

,  [RED,  TRIANGLE],  0] 

,  [MEDIUM,  RED,  TRIANGLE],  1] 

,  [SMALL,  RED,  TRIANGLE],  0] 

Total  number  of 

new  Rules  =  4 

[[SMALL,  TRIANGLE],  [RED,  SMALL,  TRIANGLE],  0.729] 

[[RED,  TRIANGLE],  [RED,  TRIANGLE],  4.0356293] 

[[RED,  TRIANGLE],  [BIG,  RED,  TRIANGLE],  0.9] 

[[RED,  TRIANGLE],  [MEDIUM,  RED,  TRIANGLE],  1.0] 
****************************************************** 

1  [snip] 

Figure  4-3:  IQA  Log  File  Segment  -  [RED  TRIANGLE]  Search 


Each  of  these  four  test  searehes  provides  a  list  of  unelassified  doeuments  sorted  in 
relevant  order.  Figure  4-4  shows  the  first  two  searehes,  while  Figure  4-5  shows  the  third. 
Results  in  italies  indieate  sort  orders  of  interest,  while  results  in  bold  indieate  whieh 
results  are  subsequently  elassified  as  positive.  Bold  also  indieates  rules  built  after 
elassifieation. 

The  first  log  shows  the  seareh  for  [SMAFF  TRIANGFE].  The  top  three 
doeuments  all  have  a  relevanee  weight  of  2.0,  so  the  sort  order  depends  on  the  order  the 
raw  doeuments  were  loaded.  Sinee  no  rules  exist,  term  matehing  determines  the  relevanee 
weight.  The  seeond  and  sueeessive  searehes  in  this  seetion  are  for  [RED  TRIANGFE]. 
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The  third  search  has  rule  weights  associated  with  them.  Since  each  of  the  top  three 
records  matches  the  rule,  the  rule  weights  are  equal  and  the  sort  order  does  not  change. 
IQ  A  builds  new  rule  in  the  third  search  as  show  in  Figure  4-5. 


FIRST  SEARCH 


SECOND  SEARCH 


Total  number  of  unclassified  results  =  24 


Total  number  of  unclassified  results  =  24 


[[SMALL,  TRIANGLE], 
[[SMALL,  TRIANGLE], 
[[SMALL,  TRIANGLE], 
[[SMALL,  TRIANGLE], 
[[SMALL,  TRIANGLE], 
[[SMALL,  TRIANGLE], 
[[SMALL,  TRIANGLE], 
[[SMALL,  TRIANGLE], 
[[SMALL,  TRIANGLE], 
[[SMALL,  TRIANGLE], 
[snip] 

[ [SMALL,  TRIANGLE] 
[[SMALL,  TRIANGLE], 
[[SMALL,  TRIANGLE], 
[[SMALL,  TRIANGLE], 


[SMALL,  BLUE,  TRIANGLE],  2.0] 
[SMALL,  TRIANGLE],  2.0] 

[SMALL,  RED,  TRIANGLE],  2.0] 
[SMALL,  GREEN,  TRIANGLE],  2.0] 
[RED,  TRIANGLE],  1.0] 

[BIG,  BLUE,  TRIANGLE],  1.0] 
[GREEN,  TRIANGLE],  1.0] 

[SMALL,  GREEN,  CIRCLE],  1.0] 
[MEDIUM,  TRIANGLE],  1.0] 

[SMALL,  SQUARE],  1.0] 

[SMALL,  CIRCLE],  1.0] 

[SMALL,  BLUE,  SQUARE],  1.0] 
[BIG,  GREEN,  TRIANGLE],  1.0] 
[SMALL,  RED,  SQUARE],  1.0] 


[[RED, 

[[RED, 

[[RED, 

[[RED, 

[ [RED, 
[ [RED, 
[ [RED, 
[ [RED, 
[ [RED, 
[ [RED, 
[snip] 
[ [RED, 
[ [RED, 
[ [RED, 
[ [RED, 


TRIANGLE]  , 
TRIANGLE]  , 
TRIANGLE]  , 
TRIANGLE]  , 

TRIANGLE] , 
TRIANGLE] , 
TRIANGLE] , 
TRIANGLE] , 
TRIANGLE] , 
TRIANGLE] , 

TRIANGLE] , 
TRIANGLE] , 
TRIANGLE] , 
TRIANGLE] , 


[RED,  TRIANGLE] ,  2.0] 

[BIG,  RED,  TRIANGLE],  2.0] 
[MEDIUM,  RED,  TRIANGLE] ,  2.0] 
[SMALL,  RED,  TRIANGLE],  2.0] 

[GREEN,  TRIANGLE],  1.0] 

[MEDIUM,  TRIANGLE],  1.0] 
[MEDIUM,  GREEN,  TRIANGLE],  1.0] 
[BIG,  RED,  CIRCLE],  1.0] 

[BIG,  GREEN,  TRIANGLE],  1.0] 
[MEDIUM,  RED,  SQUARE],  1.0] 


[BIG,  BLUE,  TRIANGLE],  1.0] 
[SMALL,  GREEN,  TRIANGLE],  1.0] 
[SMALL,  RED,  SQUARE],  1.0] 
[RED,  SQUARE],  1.0] 


[snip] 


[snip] 


Total  number  of  Rules  =  1 


Total  number  of  Rules  =  2 


[[SMALL,  TRIANGLE],  [RED,  SMALL,  TRIANGLE],  1.0] 


[[SMALL,  TRIANGLE],  [RED,  SMALL,  TRIANGLE],  0.9] 

[[RED,  TRIANGLE],  [RED,  TRIANGLE],  3.375] 


Figure  4-4:  IQA  Log  File  Segment  -  [SMALL  TRIANGLE]  and  [RED  TRIANGLE]  Search 


Figure  4-6  shows  the  fourth  and  fifth  searches.  The  fourth  search  shows  a  change 
in  sort  order.  The  fourth  search  uses  the  two  [RED  TRIANGLE]  rules  to  calculate  the 
relevance  of  returned  documents.  This  increases  the  relevance  weight  of  document  [BIG 
RED  TRIANGLE],  and  thus  moves  it  to  the  top  of  the  search  order.  In  the  fourth  search, 
a  different  document  is  classified  as  positive,  which  generates  a  fourth  rule.  This  fourth 
rule  affects  the  fifth  search  by  again  reordering  the  top  three  results. 

IQA  generates  rules  on  the  ‘Shape  Dataset’  in  an  intuitive  way.  This  is  due  to  the 
symmetry  of  the  data  in  both  number  of  terms  in  each  document  and  the  equal  term 
distribution.  Document  length  symmetry  and  term  proportionality  reduces  testing  and 
debugging  complexities.  However,  testing  with  non-symmetric  and  non-proportionate 
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data  is  highly  desired  sinee  it  is  mueh  eloser  to  a  live  seareh  environment.  Appendix  C 
ineludes  the  eomplete  log  file  for  these  tests. 


THIRD  SEARCH 


Total  number  of  unclassified  results  =  24 


[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[snip] 

[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 


[RED,  TRIANGLE],  20.140625] 

[BIG,  RED,  TRIANGLE],  20.140625] 
[MEDIUM,  RED,  TRIANGLE],  20.140625] 
[SMALL,  RED,  TRIANGLE],  20.140625] 
[GREEN,  TRIANGLE],  1.0] 

[MEDIUM,  TRIANGLE],  1.0] 

[MEDIUM,  GREEN,  TRIANGLE],  1.0] 

[BIG,  RED,  CIRCLE],  1.0] 

[BIG,  GREEN,  TRIANGLE],  1.0] 

[MEDIUM,  RED,  SQUARE],  1.0] 

[SMALL,  RED,  CIRCLE],  1.0] 
[TRIANGLE],  1.0] 


[SMALL,  RED,  SQUARE] 
[RED,  SQUARE],  1.0] 


1.0] 


Total  number  of  Rules  =  3 


[[SMALL,  TRIANGLE],  [RED,  SMALL,  TRIANGLE],  0.81] 
[[RED,  TRIANGLE],  [RED,  TRIANGLE],  3.690562] 

[[RED,  TRIANGLE],  [BIG,  RED,  TRIANGLE],  1.0] 


Figure  4-5:  IQA  Log  File  Segment  -  [RED  TRIANGLE]  Search 


4,6  Pseudo  Generalization 

The  term  pseudo  generalization  is  the  proeess  IQA  uses  to  find  semantieally 
similar  rules  for  given  seareh  terms.  IQA  uses  those  similar  terms  to  build  new  temporary 
rules, and  returns  doeuments  that  mateh  those  rules.  IQA  seales  the  rule  weight  of  those 
pseudo-generalized  rules  by  a  faetor  based  on  the  number  of  eurrent  rules  that  mateh  the 
seareh  terms,  and  the  5  between  the  original  and  generalized  terms. 

Three  aspeets  of  pseudo  generalization  performanee  are  of  interest.  The  first 
aspeet  is  the  ereation  of  pseudo-generalized  rules  and  the  effeets  on  doeument  relevanee 
order.  The  next  is  the  rule  weight  versus  pseudo-generalized  rule  weight,  and  how  they 
affeet  doeument  relevanee  seores.  The  final  eharaeteristie  is  the  effeet  of  negative 
doeument  elassifieation  of  pseudo-generalized  rules. 
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FOURTH  SEARCH 


FIFTH  SEARCH 


Total  number  of  unclassified  results  =  24 


Total  number  of  unclassified  results  =  24 


[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[snip] 

[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 

[snip] 


[BIG,  RED,  TRIANGLE],  26.001371] 
[RED,  TRIANGLE],  23.001371] 

[MEDIUM,  RED,  TRIANGLE],  23.001371] 
[SMALL,  RED,  TRIANGLE],  23.001371] 
[GREEN,  TRIANGLE],  1.0] 

[MEDIUM,  TRIANGLE],  1.0] 

[MEDIUM,  GREEN,  TRIANGLE],  1.0] 
[BIG,  RED,  CIRCLE],  1.0] 

[BIG,  GREEN,  TRIANGLE],  1.0] 


[SMALL 

[BIG, 

[SMALL 

[SMALL 

[RED, 


,  TRIANGLE],  1.0] 

BLUE,  TRIANGLE],  1.0] 

,  GREEN,  TRIANGLE],  1.0] 
,  RED,  SQUARE],  1.0] 
SQUARE],  1.0] 


Total  number  of  Rules  =  4 


[[SMALL,  TRIANGLE],  [RED,  SMALL,  TRIANGLE],  0.729] 
[[RED,  TRIANGLE],  [RED,  TRIANGLE],  4.0356293] 
[[RED,  TRIANGLE],  [BIG,  RED,  TRIANGLE],  0.9] 

[[RED,  TRIANGLE],  [MEDIUM,  RED,  TRIANGLE],  1.0] 


[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 
[[RED,  TRIANGLE] 


[MEDIUM,  RED,  TRIANGLE],  29.357563] 
[BIG,  RED,  TRIANGLE],  26.531805] 
[RED,  TRIANGLE],  26.357563] 

[SMALL,  RED,  TRIANGLE],  26.357563] 
[GREEN,  TRIANGLE],  1.0] 

[MEDIUM,  TRIANGLE],  1.0] 

[MEDIUM,  GREEN,  TRIANGLE],  1.0] 

[BIG,  RED,  CIRCLE],  1.0] 

[BIG,  GREEN,  TRIANGLE],  1.0] 

[MEDIUM,  RED,  SQUARE],  1.0] 

[SMALL,  RED,  CIRCLE],  1.0] 
[TRIANGLE],  1.0] 

[BIG,  TRIANGLE],  1.0] 

[BIG,  RED,  SQUARE],  1.0] 

[SMALL,  BLUE,  TRIANGLE],  1.0] 

[RED,  CIRCLE],  1.0] 

[MEDIUM,  RED,  CIRCLE],  1.0] 

[BLUE,  TRIANGLE],  1.0] 

[MEDIUM,  BLUE,  TRIANGLE],  1.0] 
[SMALL,  TRIANGLE],  1.0] 

[BIG,  BLUE,  TRIANGLE],  1.0] 

[SMALL,  GREEN,  TRIANGLE],  1.0] 
[SMALL,  RED,  SQUARE],  1.0] 

[RED,  SQUARE],  1.0] 


Figure  4-6:  IQA  Log  File  Segment  -  [RED  TRIANGLE]  Searches 


4.6,1  Pseudo-generalize  Rule  Creation 

Rules  must  exist  for  pseudo  generalization  to  oeeur,  so  the  tester  searehes  and 
elassifies  a  doeument  to  ereate  a  rule.  Then  the  tester  uses  semantieally  similar  terms  in 
searehing  the  data  ensures  IQA  generalizes  rules  within  the  defined  5.  For  these 
evaluations,  the  6  threshold  is  2  for  all  searehes.  This  ensures  that  IQA  ehanges  at  most 
one  seareh  term  during  pseudo-generalization. 

The  next  seareh  uses  the  terms  [BIG  CIRCLE]  with  no  learned  rules.  The  tester 
elassifies  the  doeument  [BIG  GREEN  CIRCEE]  as  positive  and  the  rest  of  the  doeuments 
as  non-applieable.  This  generates  the  rule  [BIG  CIRCEE]  ^  [GREEN  BIG  CIRCEE]  as 
shown  in  Eigure  4-7.  Note  that  the  doeument  elassified  as  positive  is  third  in  the 
unelassified  results  list. 
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Total 

number  of 

unclassified  results  =  24 

[ [BIG, 

CIRCLE] , 

[SMALL,  CIRCLE],  1.0] 

[ [BIG, 

CIRCLE]  , 

[BIG,  GREEN,  TRIANGLE],  1.0] 

[ [BIG, 

CIRCLE] , 

[BIG,  GREEN,  SQUARE],  1.0] 

[ [BIG, 

CIRCLE]  , 

[MEDIUM,  CIRCLE],  1.0] 

[snip] 

Total 

number  of 

classified  results  =  24 

[ [BIG, 

CIRCLE] , 

[BIG,  RED,  CIRCLE],  0] 

[ [BIG, 

CIRCLE]  , 

[BIG,  GREEN,  CIRCLE],  I] 

[ [BIG, 

CIRCLE] , 

[BIG,  CIRCLE],  0] 

[snip] 

Total 

number  of 

Rules  =  1 

[[BIG, 

CIRCLE] , 

[GREEN,  BIG,  CIRCLE],  1.0] 

*************************************************** 

Figure  4-7:  IQA  Log  File  Segment  -  Pseudo-generalize  Rule  Creation  for  [BIG  CIRCLE] 


The  next  search  uses  the  terms  [BIG  SQUARE]  again.  Since  [SQUARE]  has  a  6 
of  2  from  [CIRCEE],  IQA  should  create  a  pseudo  generalized  rule  of  [BIG  SQUARE]  ^ 
[GREEN  BIG  SQUARE],  The  existence  of  this  temporary  rule  forces  the  document  [BIG 
GREEN  SQUARE]  to  have  a  higher  document  relevance  weight  above  all  others  and 
thus  appear  first  for  classification.  Figure  4-8  shows  that  IQA  presents  the  document 
[BIG  GREEN  SQUARE]  to  the  user  for  classification  first  and  that  the  relevance  weight 
is  indeed  greater  than  other  documents  with  the  terms  [BIG  SQUARE],  This  is  due  to  the 
existence  of  the  semantically  similar  rule  [BIG  CIRCEE]  —>■  [GREEN  BIG  CIRCEE],  All 
returned  documents  are  classified  as  non-applicable  for  this  test  and  IQA  creates  no  new 
rules.  IQA  also  factors  the  rule  weight  for  [BIG  CIRCEE]  by  0.9. 
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Total 

number  of 

unclassified  results 

=  24 

[ [BIG, 

SQUARE] , 

[BIG, 

GREEN,  SQUARE] 

,  3.777778] 

[ [BIG, 

SQUARE] , 

[BIG, 

BLUE,  SQUARE] , 

2.0] 

[ [BIG, 

SQUARE] , 

[BIG, 

RED,  SQUARE] , 

2.0] 

[ [BIG, 

SQUARE] , 

[BIG, 

SQUARE],  2.0] 

[snip] 

Total  number  of  Rules  =  1 


[[BIG,  CIRCLE],  [GREEN,  BIG,  CIRCLE],  0.9] 


Figure  4-8:  IQA  Log  File  Segment  -  Pseudo-generalize  Rule  Search  for  [BIG  SQUARE] 


Note  the  relevanee  weight  for  the  unclassified  document  [BIG  GREEN 
SQUARE],  Two  search  terms  match,  as  well  as  the  matching  pseudo-generalized  rule. 

2 

IQA  computes  the  pseudo-generalized  rule  weight  grw  .  The  relevance  score  rs  is 
computed  as  3.778  as  shown  below: 


1/1  1  ^  2 

grw  =  1 X  ( - 1 - )  =  — , 

1+2  2+1  3 


[4-1] 


rs  =  ((2  + 1)  X  |)  +  (2  X 1)  =  1 .777778  +  2  =  3.778.  [4-2] 

4,6,2  Rules  versus  Pseudo  Generalized  Rules 

Rule  weights  have  the  strongest  influence  on  document’s  relevance  score.  If  a  rule 
exists  for  a  set  of  search  terms  and  IQA  pseudo  generalizes  another  semantically  similar 
rule,  it  is  plausible  that  a  native  rule  could  have  a  smaller  rule  weight  than  a  pseudo- 
generalized  rule,  and  therefore  less  of  an  influence  on  the  overall  relevance  score.  The 
following  example  shows  just  that. 

The  single  rule  built  in  Section  4.6.1  and  the  ‘Shape  Dataset’  provide  the  basis  for 
this  test.  Since  a  second  search  yielded  no  positive  or  negative  classifications,  IQA 
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adjusts  the  rule’s  weight  by  a  faetor  of  0.9  as  shown  in  Figure  4-8.  The  seareh  terms  are 


again  [BIG  CIRCLE]  and  the  tester  elassifies  only  the  doeument  [BIG  GREEN  CIRCEE] 


as  positive.  The  tester  repeats  this  seareh  and  elassifieation  to  reinforee  the  learned  rule. 


Eigure  4-9  shows  the  eurrent  relevanee  seores  and  rule  weight. 


Total  number  of 

unclassified 

results 

=  24 

[ [BIG,  CIRCLE] , 

[BIG, 

GREEN, 

CIRCLE] , 

6.5224996] 

[ [BIG,  CIRCLE] , 

[BIG, 

RED,  CIRCLE] ,  2 

.0] 

[ [BIG,  CIRCLE] , 
[snip] 

[BIG, 

BLUE, 

CIRCLE]  , 

2.0] 

Total  number  of 

Rules 

=  I 

[ [BIG,  CIRCLE] , 

[GREEN,  BIG, 

CIRCLE] , 

2.0249999] 

*************************************************** 

Figure  4-9:  IQA  Log  File  Segment  -  Pseudo  Generalized  Rule  Weight 


The  seareh  terms  are  now  [GREEN  SQUARE].  The  only  rule  that  exists  is  [BIG 
CIRCEE]  ^  [GREEN  BIG  CIRCEE].  Sinee  [GREEN]  is  a  5  of  2  from  [BEUE],  and 
[SQUARE]  is  also  a  6  of  2  from  [CIRCEE],  the  eombined  5  is  greater  than  the  6 
threshold  of  2.  IQA  should  not  pseudo-generalize  a  rule  beyond  the  5  threshold  for  this 
test.  The  eombined  5  for  the  eurrent  seareh  terms  to  the  existing  rule  is  greater  than  2,  so 
there  should  be  no  pseudo  generalization  beyond  the  6  threshold.  The  tester  elassifies  the 
doeument  [BIG  GREEN  SQUARE]  as  positive  and  IQA  generates  the  results  shown  in 
Eigure  4-10. 

The  tester  repeats  the  seareh  twiee  for  [BIG  CIRCEE]  and  elassifies  the  doeument 
[BIG  GREEN  CIRCEE]  as  positive.  Eigure  4-1 1  shows  the  results.  The  rule  for  [BIG 
CIRCEE]  is  now  more  than  four  times  as  large  as  the  rule  for  [GREEN  SQUARE]. 
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Total  number  of  unclassified  results  =  24 


[[GREEN,  SQUARE],  [SMALL,  GREEN,  SQUARE],  2.0] 
[[GREEN,  SQUARE],  [BIG,  GREEN,  SQUARE],  2.0] 
[[GREEN,  SQUARE],  [GREEN,  SQUARE],  2.0] 

[[GREEN,  SQUARE],  [MEDIUM,  GREEN,  SQUARE],  2.0] 
[snip] 


Total  number  of  classified  results  =  24 


[[GREEN,  SQUARE],  [SMALL,  GREEN,  SQUARE],  0] 
[[GREEN,  SQUARE],  [BIG,  GREEN,  SQUARE],  1] 
[[GREEN,  SQUARE],  [GREEN,  SQUARE],  0]  [snip] 


Total  number  of  Rules  =  2 


[[BIG,  CIRCLE],  [GREEN,  BIG,  CIRCLE],  1.8224999] 

[[GREEN,  SQUARE],  [BIG,  GREEN,  SQUARE],  1.0] 


Figure  4-10:  IQA  Log  File  Segment  -  Pseudo  Generalized  Rule  Weight  Comparison 


Total 

number  of 

unclassified 

results  =  24 

[ [BIG, 

CIRCLE] , 

[BIG,  GREEN, 

CIRCLE],  14.940888] 

[ [BIG, 

CIRCLE] , 

[BIG,  RED,  CIRCLE],  2.0] 

[ [BIG, 

CIRCLE] , 

[BIG,  CIRCLE] 

,  2.0] 

[snip] 

»»>  RULES  AFTER  PROMOTION  ««< 

Total 

number  of 

Rules  =  2 

[ [BIG, 

CIRCLE] , 

[GREEN,  BIG, 

CIRCLE],  4.100625] 

[ [GREEN,  SQUARE] 

,  [BIG,  GREEN,  SQUARE],  0.81] 

*************************************************** 

Figure  4-11:  IQA  Log  File  Segment  -  Rule  Weight  Reinforcement 


The  tester  now  ehanges  the  seareh  terms  to  [GREEN  SQEIARE]  and  eompares  the 
unelassified  results  against  a  seareh  for  [BIG  SQUARE]  without  elassifying  any 
doeuments  on  either  seareh.  The  seareh  [BIG  SQUARE]  has  no  rules  assoeiated  with  it, 
but  it  does  generalize  with  the  rule  for  [BIG  CIRCEE].  Eigure  4-12  shows  the  differenee 
in  relevanee  weights  returned  by  eaeh  respeetive  seareh.  Both  searehes  mateh  one  rule 
and  have  two  matehing  terms,  but  IQA  generalizes  [BIG  SQUARE]  ^  [BIG  GREEN 
SQUARE]  from  [BIG  CIRCEE]  ^  [BIG  GREEN  CIRCEE].  Sinee  [BIG  CIRCEE]  has  a 
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much  greater  rule  weight  than  [GREEN  SQEIARE],  its  relevance  score  is  also  higher 
even  when  scaled  down  by  the  pseudo  generalization  algorithm. 


Total  number  of  unclassified  results  =  24 


[[GREEN,  SQUARE],  [BIG,  GREEN,  SQUARE],  4.2761] 
[[GREEN,  SQUARE],  [SMALL,  GREEN,  SQUARE],  2.0] 
[[GREEN,  SQUARE],  [GREEN,  SQUARE],  2.0] 

[[GREEN,  SQUARE],  [MEDIUM,  GREEN,  SQUARE],  2.0] 
[snip] 


Total  number  of  unclassified  results  =  15 


[[BIG,  SQUARE],  [BIG,  GREEN,  SQUARE],  12.505876] 

[[BIG,  SQUARE],  [BIG,  RED,  SQUARE],  2.0] 

[[BIG,  SQUARE],  [BIG,  BLUE,  SQUARE],  2.0] 

[snip  ] 


Figure  4-12:  IQA  Log  File  Segment  -  [GREEN  SQUARE]  vs  [BIG  SQUARE]  search  comparison 


4,6,3  Pseudo-Generalized  Rules  Classified  as  Negative 

IQA  learns  rules  through  the  standard  FOIL  algorithm  as  described  in  Section  4.4. 
However,  IQA  derives  pseudo-generalized  rules  from  the  process  described  in  Section 
4.6.1.  If  the  user  classifies  all  documents  returned  by  pseudo-generalized  rules  as 
negative  or  non-applicable,  IQA  ignores  them. 

4,7  Rule  Generalization 

A  pseudo-generalized  rule  becomes  a  generalize  rule  when  one  or  more  of  the 
documents  IQA  returns  are  classified  as  positive.  This  happens  concurrently  while  IQA  is 
using  FOIL  to  learn  other  rules  based  upon  document  classification.  This  section 
demonstrates  rule  promotion  and  rule  absorption  functions. 
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4,7,1  Rule  Promotion 


Rule  promotion  refers  to  taking  two  or  more  rules  and  generalizing  one  of  the 
terms  to  a  parent  term.  Figure  4-13  shows  an  example  of  this.  Rule  promotion  is  desirable 
to  keep  the  number  of  rules  minimized. 


_ BEFORE  RULE  PROMOTION _ 

[GREEN  TRIANGLE]  ^  [SMALL  GREEN  TRIANGLE] 

[BLUE  TRIANGLE]  ^  [SMALL  BLUE  TRIANGLE] 

_ AFTER  RULE  PROMOTION _ 

[COLOR  TRIANGLE]^[SMALL  COLOR  TRIANGLE] 

Figure  4-13:  Rule  Promotiou 

This  test  starts  with  no  rules  learned  and  a  seareh  for  [GREEN  TRIANGEE].  The 
tester  elassifies  the  doeument  [SMALE  GREEN  TRIANGEE]  as  positive.  The  next 
seareh  is  for  [BEUE  TRIANGEE]  and  tester  elassifies  the  doeument  [SMALE  BELIE 
TRIANGLE]  as  positive.  Eigure  4-14  shows  the  log  file  results  for  this  test.  Note  that  in 
the  seeond  seareh,  IQA  generalizes  the  [GREEN  TRIANGLE]  rule  for  the  [BLUE 
TRIANGLE]  seareh  and  therefore  returns  a  higher  relevanee  seore.  Onee  the  EOIL 
algorithm  is  eomplete,  the  rule  promotion  funetion  goes  through  all  rules,  eomparing 
them  with  the  semantie  tree  searehing  for  eandidates  for  promotion.  In  this  instanee  the 
terms  [BLUE]  and  [GREEN]  are  both  semantie  siblings  of  the  term  [COLOR].  Sinee  two 
of  the  three  siblings  have  a  similar  rule,  IQA  promotes  (generalizes)  them. 

IQA  ean  also  promote  a  set  of  generalized  rules.  Consider  a  seareh  for  [RED 
SQUARE]  that  positively  elassifies  [SMALL  RED  SQUARE],  and  then  seareh  again  for 
[GREEN  SQUARE]  and  positively  elassify  [SMALL  GREEN  SQUARE].  In  this 
instanee,  IQA  ereates  the  rule  set  shown  in  Eigure  4-15.  Note  that  IQA  promotes  these 
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two  rules  to  the  more  general  rule  of  [COLOR  SQUARE]  ^  [SMALL  COLOR 
SQUARE], 


Total  number  of  unclassified  results  =  24 


[[GREEN,  TRIANGLE],  [BIG,  GREEN,  TRIANGLE],  2.0] 
[[GREEN,  TRIANGLE],  [GREEN,  TRIANGLE],  2.0] 

[[GREEN,  TRIANGLE],  [MEDIUM,  GREEN,  TRIANGLE],  2.0] 
[[GREEN,  TRIANGLE],  [SMALL,  GREEN,  TRIANGLE],  2.0] 
[snip] 


Total  number  of  new  Rules  =  1 


[[GREEN,  TRIANGLE],  [SMALL,  GREEN,  TRIANGLE],  1.0] 

****************************************************** 

{snip  to  next  search] 


Total  number  of  unclassified  results  =  24 


[[BLUE,  TRIANGLE],  [SMALL,  BLUE,  TRIANGLE],  3.777778] 
[[BLUE,  TRIANGLE],  [BIG,  BLUE,  TRIANGLE],  2.0] 

[[BLUE,  TRIANGLE],  [BLUE,  TRIANGLE],  2.0] 

[[BLUE,  TRIANGLE],  [MEDIUM,  BLUE,  TRIANGLE],  2.0] 


RULES  BEFORE  PROMOTION 


Total  number  of  Rules  =  2 


[[GREEN,  TRIANGLE],  [SMALL,  GREEN,  TRIANGLE],  0.9] 
[[BLUE,  TRIANGLE],  [SMALL,  BLUE,  TRIANGLE],  1.0] 
****************************************************** 


RULES  AFTER  PROMOTION 


Total  number  of  Rules  =  1 


[[COLOR,  TRIANGLE],  [SMALL,  COLOR,  TRIANGLE],  1.425] 

****************************************************** 


Figure  4-14:  IQA  Log  File  Segment  -  Rule  Promotion 


However,  now  two  rules  exist  that  are  eandidates  for  promotion  again.  IQA  takes 
eare  of  this  situation  on  the  next  subsequent  seareh.  It  is  eomputationally  more  effieient 
in  systems  with  larger  rule  sets  to  perform  this  eheek  one  time  after  doeument 
elassifieation,  rather  than  iteratively  searehing  through  rule  sets.  However,  any 
subsequent  seareh  would  yield  the  results  shown  in  Figure  4-16.  Note  that  the  terms 
[TRIANGLE]  and  [SQUARE]  have  the  same  semantie  parent  of  [SHAPE]  and  are 


61 


therefore  suitable  for  promotion  (generalization),  and  IQA  does  promote  them  in  the 
subsequent  seareh. 


»»>  RULES  BEFORE  PROMOTION 


Total  number  of  Rules  =  3 


[[COLOR,  TRIANGLE],  [SMALL,  COLOR,  TRIANGLE],  1.1542499] 

[[RED,  SQUARE],  [SMALL,  RED,  SQUARE],  0.9] 

[[GREEN,  SQUARE],  [SMALL,  GREEN,  SQUARE],  1.0] 

****************************************************** 


RULES  AFTER  PROMOTION 


Total  number  of  Rules  =  2 


[[COLOR,  TRIANGLE],  [SMALL,  COLOR,  TRIANGLE],  1.1542499] 
[[COLOR,  SQUARE],  [SMALL,  COLOR,  SQUARE],  1,425] 
****************************************************** 

Figure  4-15:  IQA  Log  File  Segment  -  Generalize  Rule  Promotion  Before 


4,7,2  Rule  Assimilation 

Rule  assimilation  is  neeessary  when  IQA  builds  a  more  speeialized  rule  using 
FOIL,  but  a  more  general  rule  already  exists.  Consider  the  state  of  the  database  in  Figure 
4-13.  A  seareh  for  [RED  TRIANGLE]  yielded  a  pseudo-generalized  rule  from  the 
semantie  parent  of  the  term  [RED].  However,  elassifying  [SMALL  RED  TRIANGLE]  as 
positive  ereates  the  rule  [RED  TRIANGLE]  ^  [SMALL  RED  TRIANGLE]  as  shown  in 
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Figure  4-17.  The  relevanee  seore  for  [SMALL  RED  TRIANGLE]  indieates  that  IQA 
ereates  a  pseudo-generalized  rule  during  the  seareh. 

Rule  assimilation  eompares  this  rule  to  rules  at  eaeh  of  the  parent  terms  to  see  if 
there  is  one  suitable  for  absorption.  The  proeess  of  assimilation  deletes  the  sibling  rule, 
and  inerements  the  rule  weight  of  the  parent  rule  per  the  promotion  rules.  The  ereation  of 
this  sibling  rule  is  akin  to  a  positive  elassifieation  of  the  parent  rule.  In  a  sense,  rule 
assimilation  is  a  form  of  rule  promotion  and  therefore  oeeurs  in  the  same  manner  that  rule 


promotion  does. 


Total  number  of  unclassified  results  =  24 


[[RED,  TRIANGLE],  [SMALL,  RED,  TRIANGLE],  5.7851562] 
[[RED,  TRIANGLE],  [MEDIUM,  RED,  TRIANGLE],  2.0] 
[[RED,  TRIANGLE],  [BIG,  RED,  TRIANGLE],  2.0] 

[snip] 


RULES  BEFORE  PROMOTION 


Total  number  of  Rules  =  2 


[[COLOR,  TRIANGLE],  [SMALL,  COLOR,  TRIANGLE],  1.2824999] 

[[RED,  TRIANGLE],  [SMALL,  RED,  TRIANGLE],  1.0] 

****************************************************** 


RULES  AFTER  PROMOTION 


Total  number  of  Rules  =  1 


[[COLOR,  TRIANGLE],  [SMALL,  COLOR,  TRIANGLE],  2.1374998] 
****************************************************** 


Figure  4-17:  IQA  Log  File  Segment  -  Rule  Assimilation 


4,8  Rule  Aging 

Rule  aging  is  neeessary  to  remove  rules  that  have  lost  their  usefulness.  These 
rules  have  either  gone  unused  for  an  extended  period,  repeatedly  been  elassified  as 
negative,  or  some  eombination  of  both.  This  provides  an  automatie  way  to  remove 
useless  rules.  The  rule-aging  limit  value  is  set  at  0.015  per  the  eonstraints  identified  in 
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Section  3. 5. 4. 3.  This  ensures  a  rule  will  age  after  40  searches  if  not  classified,  7  search  is 
classified  as  negative,  and  somewhere  in  between  if  a  combination  of  the  two. 

Consider  Figure  4-7.  The  tester  searches  again  for  [BIG  CIRCLE],  yet  this  time 
classifies  [BIG  RED  CIRCEE]  as  positive  and  [BIG  GREEN  CIRCEE]  as  negative.  This 
cycle  repeats  6  times  until  IQ  A  deletes  the  original  rule  as  shown  in  Eigure  4-18.  Once 


the  rule  weight  of  any  rule  has  dropped  below  the  usefulness  level,  IQA  deletes  the  rule. 


Total  number  of  Rules  =  2 


[[BIG,  CIRCLE],  [GREEN,  BIG,  CIRCLE],  0.5] 

[[BIG,  CIRCLE],  [RED,  BIG,  CIRCLE],  1.0] 
****************************************************** 


Total  number  of  Rules  =  2 


[[BIG,  CIRCLE],  [GREEN,  BIG,  CIRCLE],  0.25] 

[[BIG,  CIRCLE],  [RED,  BIG,  CIRCLE],  1.5] 
****************************************************** 

[snip  cycles  2-5] 


RULES  BEFORE  PROMOTION 


Total  number  of  Rules  =  2 


[[BIG,  CIRCLE],  [GREEN,  BIG,  CIRCLE],  0.0078125] 

[[BIG,  CIRCLE],  [RED,  BIG,  CIRCLE],  11.390625] 
****************************************************** 


RULES  AFTER  PROMOTION 


Total  number  of  Rules  =  1 


[[BIG,  CIRCLE],  [RED,  BIG,  CIRCLE],  11.390625] 


Figure  4-18:  IQA  Log  File  Segment  -  Rule  Aging 


4,9  IQA  versus  Document  Count  Testing 

The  following  automated  tests  use  both  the  ‘Shape  Dataset’  and  the  ‘NAIC 
Dataset’  to  quantify  the  eapabilities  of  IQA.  These  tests  provide  a  eomparison  of  IQA’s 
ability  for  ordering  doeuments  eorreetly  based  upon  rules  learned.  IQA  also  stores  a  raw 
doeument  counter  for  documents  positively  classified.  The  raw  document  counter  keeps 
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track  of  a  value  for  each  document  and  represents  how  users  classify  a  document  over  a 
period  of  searches.  A  positive  classification  causes  the  document  count  to  increment  by  1, 
while  a  negative  classification  causes  a  decrement  of  1.  Non-applicable  classifications 
have  no  effect.  This  raw  document  counter  and  the  relevance  weight  are  the  instrumental 
tools  of  this  test. 

IQ  A  collects  the  order  in  which  it  returns  documents  in  two  ways.  IQ  A  searches 
for  documents  and  sorts  first  by  relevance  weight  and  then  by  document  count.  IQA 
automatically  classifies  each  document  based  on  two  sets  of  probabilities.  The  first 
probability  set  consists  of  terms  that  force  IQA  to  classify  a  document  as  positive.  The 
second  set  consists  of  terms  that  force  IQA  to  classify  a  document  as  negative.  IQA 
classifies  the  remaining  documents  as  non-applicable. 

IQA  resorts  the  documents  by  classification,  and  compares  the  classified  result 
order  to  the  original  order  presented.  The  difference  in  those  orders  is  stored  as  a 
correctness  percentage.  The  plot  and  analysis  of  this  data  provides  the  running  means 
(averages)  of  correctness  for  both  types  of  documents  sorts,  and  allows  for  a  direct 
comparison  of  the  two  approaches. 

4.9,1  Probability  Sets 

Probability  sets  are  only  used  during  automated  testing.  The  probability  sets 
consist  of  terms,  probabilities,  and  Boolean  values  that  indicate  the  document 
classification.  These  probability  sets  are  referred  to  as  “roulette  wheels.”  There  are  three 
wheels  used  for  each  test,  and  they  are  stored  in  text  files  for  ease  of  automation.  Figure 
4-19  shows  an  example  of  each  file  that  represents  a  separate  roulette  wheel. 
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Search  Terms 

Positive  Terms 

Negative  Terms 

2 , RED, TRIANGLE, 0.50, TRUE 

2 , SMALL , TRIANGLE ,0.80, TRUE 

2, MEDIUM, TRIANGLE, 1 . 00, TRUE 

1 , TRIANGLE ,0.60, TRUE 
1,RED, 1 . 00, TRUE 

1 , SQUARE ,0.50, FALSE 

1,  CIRCLE, I . 00, FALSE 

Figure  4-19:  Probability  Set  Search  and  Classification  Wheel  Example 


Each  line  in  a  file  is  a  reeord.  The  reeord  starts  with  a  number  that  identifies  the 
number  of  terms  that  follow.  The  terms  are  next,  followed  by  a  probability.  The 
probability  goes  from  any  value  greater  than  the  probability  of  the  previous  reeord  (or  0  if 
it  is  the  first  reeord)  to  this  probability.  The  final  value  in  a  reeord  is  the  Boolean  value 
that  determines  the  doeument  elassifieation.  A  value  of  [TRUE]  results  in  a  positive 
elassifieation,  while  [EAESE]  results  in  a  negative.  Negative  elassifieations  have  priority 
over  positive  ones. 

The  roulette  wheels  are  used  in  randomly  seleeting  a  term  or  set  of  terms  for  an 
aetion.  The  eoneept  of  “spinning”  a  roulette  wheel  refers  to  generating  a  probability  used 
in  an  IQA  seareh.  IQA  spins  eaeh  wheel  onee  for  a  single  automated  seareh.  In  the 
example  in  Eigure  4-19,  the  seareh  terms  [RED  TRIANGEE]  will  return  any  doeument 
eontaining  the  term  [RED].  There  is  a  20%  probability  that  IQA  will  elassify  the 
doeument  [BIG  RED  CIRCEE]  as  positive,  so  we  say  that  this  probability  ean  represent 
human  error  in  an  automated  test. 

We  also  introduee  the  eoneept  of  “binding”,  referring  to  the  number  of  terms  in 
eaeh  positive  and  negative  file.  In  Eigure  4-19,  there  are  four  different  seareh  terms.  Two 
of  those  terms  are  positively  bound  by  one  term,  while  two  other  terms  are  negatively 
bound.  If  we  seareh  for  doeuments  with  the  term  [TRIANGEE],  then  logieal  we  do  not 
want  any  doeuments  with  [CIRCEE]  or  [SQUARE]  elassified  as  positive. 
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4.9.2  Automated  Testing 

A  tester  manually  entered  seareh  terms  and  elassifieations  for  previous  tests,  but 
now  automated  testing  gathers  data  for  analysis.  IQA  uses  a  number  of  Boolean  switehes 
to  transition  between  manual  and  automated  testing.  IQA  also  has  the  ability  to  repeat  a 
test  multiple  times  with  different  roulette  wheel  spins. 

Eaeh  test  begins  with  no  rules  and  two  sets  of  roulette  wheels.  The  two  roulette 
wheels  represent  a  series  of  possible  searehes  that  eneourage  rule  generalization  as  testing 
goes  baek  and  forth  between  the  two  sets.  The  only  differenee  between  the  two  sets  is  one 
term  is  swapped  with  another  that  is  a  5  of  2  away.  For  example,  if  the  term  [SQUARE] 
is  replaeed  with  [TRIANGEE],  then  the  term  [TRIANGEE]  is  also  replaeed  with 
[SQUARE].  This  swap  assists  in  indueing  generalization,  and  gives  a  more  realistie 
seenario  for  testing. 

4.9.3  ‘Shape  Dataset’  Tests 

The  tester  eonduets  both  two-term  and  three-term  tests  with  the  ‘Shape  Dataset.’ 
Eaeh  term  test  set  uses  the  same  roulette  wheels,  less  the  one  ehanged  term  to  induee 
generalization.  The  two-term  tests  bind  1  term  for  positive  elassifieation,  while  the  three- 
term  tests  bind  1  and  then  2  terms  for  positive  elassifieation.  Eaeh  test  begins  with  a 
randomized  set  of  metadata. 

4,9,3, 1  2-Term-l  ‘Shape  Dataset’  Search 

The  first  test  uses  the  roulette  wheels  in  Figure  4-19  with  a  negative  wheel  spin. 
IQA  runs  for  50  iterations,  with  a  term  swap  of  [SQUARE]  and  [TRIANGEE]  every  5 
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iterations.  The  metadata  is  randomized  and  the  test  is  repeated  4  times.  Figure  4-20  shows 
the  results  with  a  95%  confidence  interval. 


Cycles 


■^*IQA  Mean 
- Doc  Mean 


Figure  4-20:  Shape  Test,  2-Term 

The  Figure  shows  the  IQA  mean  accuracy  ahead  of  the  document  mean  for  almost 
the  entire  test.  As  IQA  generalizes  every  5*  iteration,  its  accuracy  continues.  At  iteration 
50  IQA  has  a  12.59%  better  accuracy  rate  than  the  raw  document  count.  Changing  terms 
every  five  iterations  causes  the  raw  document  count  to  stabilize  near  67%.  This  test  learns 
six  rules  and  generalize  each  rule  by  one  term. 

Table  4-4  shows  the  results  from  five  tests.  Raw  data  shows  that  IQA  performs 
5.28%  more  accurately  on  average,  and  outperforms  the  raw  document  count  in  four  of 
five  tests.  However,  the  confidence  interval  shows  them  to  be  statistically  the  same,  and 
only  showing  a  minute  increase  at  a  confidence  level  of  74%. 
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Table  4-4:  Shape  Test  2-Term-l 


Shape  Test 

IQA  Mean 

Doc  Mean 

Rules 

Learned 

Term 

Generalized 

2  Term  1 

71.15% 

65.11% 

6.00 

6.00 

81.23% 

68.74% 

6.00 

6.00 

75.71% 

75.79% 

4.00 

4.00 

82.01% 

74.24% 

4.00 

4.00 

68.41% 

68.21% 

8.00 

7.00 

Mean 

75.70% 

70.42% 

5.60 

5.40 

StDev 

6.01% 

4.45% 

1.67 

1.34 

Confidence 

Level 

IQA  Confidence 
Interval  Width 

Doc  Confidence 
Interval  Width 

Mean  Diff 

Qverlap  (-) 

95.00% 

5.27% 

3.90% 

5.29% 

-3.88% 

90.00% 

4.42% 

3.28% 

5.29% 

-2.41% 

80.00% 

3.44% 

2.55% 

5.29% 

-0.71% 

75.00% 

3.09% 

2.29% 

5.29% 

-0.10% 

74.00% 

3.03% 

2.24% 

5.29% 

0.02% 

4,9,3.2  3-Term-X ‘Shape  Dataset’  Searches 

The  next  test  uses  the  roulette  wheels  in  Figure  4-21.  The  same  conditions  apply 
for  these  tests  as  they  did  the  two-term  tests.  The  first  of  these  uses  one  term  to  classify 


documents  as  positive. 


Search  Terms 

Positive  Terms 

Negative  Terms 

3 , B I G , BLUE , TRIANGLE ,0.50, TRUE 

3, MEDIUM, BLUE, TRIANGLE, 1 . 00, TRUE 

1 , TRIANGLE ,0.60, TRUE 
1,BLUE, 1 . 00, TRUE 

I , SQUARE ,0.40, FALSE 

1, CIRCLE, 0 . 80, FALSE 

I, SMALL, I . 00, FALSE 

Figure  4-21:  Probability  Set  Search  and  Classification  Wheel,  Three  Term  Shape  Search 


Figure  4-22  shows  one  of  the  three-term  test  results  with  a  95%  confidence 
interval.  Both  IQA  and  the  document  count  outperformed  the  two-term  tests.  However, 
IQA  learns  rules  quickly  and  generalizes  them  effectively.  The  document  count  accuracy 
drops  considerably  at  the  first  generalization  point,  and  does  so  again  at  the  second.  IQA 
has  an  accuracy  rate  8.18%  greater  than  the  raw  document  count  at  50  iterations.  This  test 
learns  four  rules,  and  all  four  rules  have  one  generalized  term. 
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Cycles 

Figure  4-22:  Shape  Test,  3-Term-l 

Table  4-5  shows  the  results  for  all  5  tests.  IQA’s  combined  results  are  somewhat 
better  than  the  raw  document  count  in  the  previous  test.  The  confidence  interval  shows  a 
95%  confidence  level  that  IQA  will  outperform  the  raw  document  count  by  just  over  3%, 
and  is  slightly  better  at  a  99%  confidence  level.  Coincidentally,  IQA  learns  the  identical 
number  of  rules  and  generalizes  the  same  number  of  term  as  it  did  in  the  2-Term- 1  test. 

The  next  test  is  also  a  three-term  test,  using  the  classification  wheel  terms  in  the 
previous  three-term  tests.  This  time  an  additional  term  placed  in  the  positive 
classification  wheel  to  increase  the  positive  binding.  The  standard  50  iterations  are  run 
with  a  term  swap  every  five  iterations. 
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Table  4-5:  Shape  Test  3-Term-l 


Shape  Test 

IQA  Mean 

Doc  Mean 

Rules 

Learned 

Term 

Generalized 

3  Term  1 

83.87% 

75.31% 

6.00 

6.00 

82.98% 

73.30% 

6.00 

6.00 

89.36% 

81.19% 

4.00 

4.00 

83.20% 

76.31% 

4.00 

4.00 

90.50% 

78.36% 

8.00 

7.00 

Mean 

85.98% 

76.89% 

5.60 

5.40 

StDev 

3.64% 

3.01% 

1.67 

1.34 

Confidence 

Level 

IQA  Confidence 
Interval  Width 

Doc  Confidence 
Interval  Width 

Mean 

Diff 

Qverlap(-) 

99.00% 

4.19% 

3.47% 

9.09% 

1.42% 

95.00% 

3.19% 

2.64% 

9.09% 

3.26% 

90.00% 

2.68% 

2.22% 

9.09% 

4.19% 

75.00% 

1.87% 

1.55% 

9.09% 

5.67% 

50.00% 

1.10% 

0.91% 

9.09% 

7.08% 

The  results  in  Figure  4-23  show  IQA  overtakes  the  raw  doeument  at  10  iterations 
(at  the  first  generalization  switch)  and  continues  to  outperform  throughout  the  test.  It 
finishes  7.89%  more  accurate  than  the  document  count.  Generalization  switches  cause  an 
oscillation  in  raw  document  count  accuracies  early  on. 

Table  4-6  shows  the  combined  results  from  all  five  tests.  Even  though  IQA 
outperforms  the  document  count  again  on  four  out  of  five  tests,  the  confidence  interval 
check  shows  these  are  statistically  equivalent.  There  is  only  a  24%  confidence  level  that 
IQA  will  outperform  the  raw  document  count  in  any  way.  This  time  IQA  learns  fewer 
rules  but  still  performs  almost  as  accurately  as  in  the  3-Term-l  test.  However,  the 
standard  deviation  for  both  the  methods  is  much  higher.  Metadata  randomization  is  the 
most  likely  cause,  combined  with  the  structure  of  the  probability  wheels. 
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Mean  Accuracy 


100.00% 


95.00% 

90.00% 

85.00% 

80.00% 

75.00% 

70.00% 

65.00% 

60.00% 

55.00% 

50.00% 
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IQAMean 
- Doc  Mean 


Cycles 


Figure  4-23:  Shape  Test,  3-Term-2  with  stronger  binding 


Table  4-6:  Shape  Test  3-Term-2 


Shape  Test 

IQA  Mean 

Doc  Mean 

Rules 

Learned 

Term 

Generalized 

3  Term  2 

93.91% 

94.16% 

2.00 

2.00 

81.04% 

73.15% 

4.00 

4.00 

93.07% 

92.53% 

3.00 

4.00 

72.58% 

71.50% 

3.00 

3.00 

75.97% 

70.02% 

3.00 

3.00 

Mean 

83.31% 

80.27% 

3.00 

3.20 

StDev 

9.77% 

12.00% 

0.71 

0.84 

Confidence 

Level 

IQA  Confidence 
Interval  Width 

Doc  Confidence 
Interval  Width 

Mean 

Diff 

Overlap 

95.00% 

8.56% 

10.52% 

3.04% 

-16.04% 

90.00% 

7.19% 

8.83% 

3.04% 

-12.97% 

80.00% 

5.60% 

6.88% 

3.04% 

-9.43% 

75.00% 

5.03% 

6.17% 

3.04% 

-8.15% 

24.00% 

1.33% 

1.64% 

3.04% 

0.07% 
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4,9,4  ‘NAIC  Dataset’  Tests 


This  data  set  uses  with  a  4-Term- 1  test  and  a  4-Term-3  test.  The  tester  performs 
these  identieally  to  the  ‘Shape  Dataset’  tests  with  on  exeeption.  Generalization  switehes 
oeeur  every  10  iterations  and  the  tests  go  for  100  iterations. 

4,9,4,1  4-Term-A‘NAIC  Dataset’  Searches 


We  seleet  the  roulette  wheels  as  shown  in  Figure  4-24  for  the  first  test.  This  test 
uses  the  negative  wheel  spin  with  four  seareh  terms  and  one  bound  positive  term. 


Search  Terms 

Positive  Terms 

Negative  Terms 

4 , CLOSE , FRONT , GROUND , MIG-21,0.50,  TRUE 

4, MEDIUM, DISTANT, INFLIGHT , MIG-2  1 , 0 . 80, TRUE 

4, CLOSE, FRONT, PARTIAL , MIG-2  1 , 1 . 00, TRUE 

1, MIG-21, 0.50, TRUE 

1 , FRONT ,0.80, TRUE 

1, CLOSE, 1 . 00, TRUE 

1, MIRAGE  2000, 0.50, FALSE 
1,MIG-21BIS, 1 . 00, FALSE 

Figure  4-24:  Probability  Set  Search  and  Classification  Wheel,  Four  Term  NAIC  Search 


Figure  4-25  shows  a  4-Term- 1  seareh  test.  Both  IQA  and  the  raw  doeument  eount 
start  very  aeeurate.  This  is  attributed  to  the  random  way  the  metadata  is  sorted.  Some 
sorts  present  doeuments  in  a  distribution  that  elosely  resembles  the  way  the  roulette 
wheels  provide  seareh  terms.  The  doeument  eount  oseillates  with  eaeh  generalization 
switeh.  However,  the  IQA  data  beeomes  less  aeeurate  after  the  29*  iteration.  IQA  eould 
be  over  fitting  the  NAIC  data  by  learning  too  many  rules  early  on.  Both  eounts  stabilize 
after  70  iterations,  but  the  raw  doeument  eount  ends  slightly  more  aeeurate  than  IQA 
(1.58%.) 

Table  4-7  shows  the  results  of  all  5  tests.  The  raw  doeument  eount  outperformed 
IQA  in  eaeh  of  the  fives  tests.  A  95%  eonfidenee  interval  eheek  shows  these  tests  are 
statistieally  equivalent.  Mueh  of  IQA’s  poor  performanee  ean  be  attributed  to  extreme 
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Mean  Accuracy 


over  fitting.  Only  three  different  sets  of  search  terms  are  on  each  roulette  wheel.  For  six 


possible  searches,  IQA  learned  more  than  3  rules  per  test  run  on  average. 


■^*IQA  Mean 
- Doc  Mean 


Cycles 


Figure  4-25:  ‘NAIC  Dataset’,  4-Term-l 


Table  4-7:  NAIC  Test  4-Term-l 


NAIC  Test 

IQA  Mean 

Doc  Mean 

Rules 

Learned 

Term 

Generalized 

4  Term  1 

79.16% 

80.88% 

38.00 

1.00 

79.62% 

80.68% 

35.00 

3.00 

78.02% 

80.15% 

40.00 

5.00 

80.69% 

82.27% 

37.00 

3.00 

79.83% 

83.13% 

32.00 

5.00 

Mean 

79.47% 

81.42% 

36.40 

3.40 

StDev 

0.98% 

1.24% 

3.05 

1.67 

Alpha 

IQA  Confidence 
Interval  Width 

Doc  Confidence 
Interval  Width 

Mean 

Diff 

Qverlap  (-) 

0.05 

0.86% 

1 .08% 

-1.96% 

-3.90% 

0.10 

0.72% 

0.91% 

-1.96% 

-3.59% 

0.20 

0.56% 

0.71% 

-1.96% 

-3.23% 

0.25 

0.50% 

0.64% 

-1.96% 

-3.10% 

0.50 

0.30% 

0.37% 

-1.96% 

-2.62% 
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The  positive  elassifieation  roulette  wheels  are  set  with  three  positive  terms  for  the 


following  tests.  Figure  2-26  shows  the  settings. 


Search  Terms 

Positive  Terms 

Negative  Terms 

4, CLOSE, FRONT, GROUND, MIG-21, 

0.50,  TRUE 

4, MEDIUM, DISTANT, INFLIGHT , MIG-2  1 , 
0.80,  TRUE 

4, CLOSE, FRONT, PARTIAL , MIG-2  1 , 

1.00,  TRUE 

3, CLOSE, FRONT, MIG-21, 

0.50,  TRUE 

3, MEDIUM, FRONT, MIG-21, 
0.80,  TRUE 

3, CLOSE, PARTIAL, MIG-21, 
1.00,  TRUE 

1, MIRAGE  2000, 0.50, FALSE 
1,MIG-21BIS, 1 . 00, FALSE 

Figure  4-26:  Probability  Set  Search  and  Classification  Wheel,  4-Term-3  NAIC  Search 


Figure  4-27  shows  a  4-Tenn-3  test.  IQA  begins  with  a  much  better  accuracy  due 
to  the  metadata  randomization  and  the  strength  of  early  rule  learning.  Both  methods 
quickly  improve  over  the  first  15  iterations.  Generalization  switches  push  the  raw 
document  count  accuracies  down,  and  they  stabilize  around  84-85%.  IQA  accuracy 
outperforms  the  raw  document  count  throughout  the  entire  test,  and  ends  up  better  by 
4.46%  after  100  iterations. 

Table  4-8  shows  the  results  of  all  five  tests.  Metadata  randomization  affected 
early  mean  accuracies  for  both  sets  of  4-Term  tests.  These  tests  represent  the  most 
consistently  accurate  and  stabile  tests  for  any  attempted.  IQA  outperforms  the  raw 
document  count  in  all  five  tests.  A  95%  confidence  interval  test  shows  IQA  performs 
better  by  almost  2%,  and  is  consistently  better  than  the  raw  document  count  at  each 
confidence  interval  measurement  point. 
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Figure  4-27:  ‘NAIC  Dataset’,  4-Term-3 


Table  4-8:  NAIC  Test  4-Term-3 


NAIC  Test 

IQA  Mean 

Doc  Mean 

Rules 

Learned 

Term 

Generalized 

4  Term  3 

88.19% 

86.61% 

30.00 

6.00 

89.43% 

85.72% 

27.00 

4.00 

88.29% 

85.79% 

27.00 

7.00 

88.89% 

84.43% 

23.00 

6.00 

88.10% 

84.39% 

26.00 

4.00 

Mean 

88.58% 

85.39% 

26.60 

5.40 

StDev 

0.57% 

0.96% 

2.51 

1.34 

Confidence 

Level 

IQA  Confidence 
Interval  Width 

Doc  Confidence 
Interval  Width 

Mean 

Diff 

Qverlap 

99.00% 

0.65% 

1.11% 

3.19% 

1.43% 

95.00% 

0.50% 

0.84% 

3.19% 

1.85% 

90.00% 

0.42% 

0.71% 

3.19% 

2.06% 

75.00% 

0.29% 

0.49% 

3.19% 

2.40% 

50.00% 

0.17% 

0.29% 

3.19% 

2.73% 
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4,10  Summary 

This  chapter  discusses  each  of  the  objectives  of  the  IQ  A  system,  and  goes  though 
a  detailed  discussion  on  each  of  the  functions.  This  analysis  establishes  IQA’s  ability  to 
learn  rules,  and  then  generalize  those  rules  aeross  a  defined  semantic  tree.  We  tested 
under  automated  conditions  to  test  IQA’s  abilities  versus  a  simple  document  count  sort. 
The  semantic  structure  of  the  data  as  well  as  the  initial  order  of  documents  influences  the 
initial  performance  of  any  search  method.  Appendix  D  shows  graphs  of  all  tests.  The 
final  chapter  presents  the  research  conclusions  as  well  as  recommendations  for  future 
work. 
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5.  Conclusions  and  Recommendations 


5.1  Introduction 

This  chapter  summarizes  the  Intelligent  Query  Answering  (IQA)  research  and 
research  objectives.  The  first  section  discusses  the  impact  of  combining  rule  learning  and 
rule  generalization.  The  next  seetion  reviews  the  objeetives  of  this  researeh  and  draws 
eonelusions  regarding  the  effieaey  of  IQA.  The  final  portion  presents  an  outline  of  some 
potential  follow  on  areas  of  study. 

5.2  Research  Impact 

The  need  for  better  methods  of  effectively  searching  large  dataset  drives  this 
researeh.  The  first  objeetive  is  the  eonstruction  of  a  rule  learning  system  that  returns 
documents  sorted  by  rule  weights.  The  second  objective  is  to  build  a  system  that 
generalizes  those  rules  with  future  searehes.  The  results  of  this  study  provide  an 
innovative  method  for  returning  relevant  user  requested  documents  by  learning  rules 
based  on  the  way  a  user  elassifies  returned  doeuments. 

5.2.1  Rule  Learning 

Rule  learning  consists  of  a  modified  implementation  of  FOIL  combined  with  a 
Winnow  algorithm  used  for  updating  rule  weights.  IQA  learns  rules  based  up  search 
terms  and  user  elassifieation  of  the  returned  doeuments.  It  returns  the  doeuments  sorted 
relevance  weights,  created  by  rule  hits,  rule  weights  and  search  term  hits.  The  rules 
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weights  adjust  over  time  based  on  user  elassifieations  of  returned  doeuments  that  have 
assoeiated  rules  and  rule  use. 

Analysis  of  the  modified  FOIL  algorithm  shows  IQ  A  ean  learn  rules  under  all 
tested  seareh  eonditions.  The  Winnow  algorithm  initializes  rule  weights  for  all  new  rules, 
and  eorreetly  updates  rule  weights  on  subsequent  elassifieations  of  returned  doeuments. 
Doeument  searehes  eompute  the  relevanee  weight  aeeurately  and  sueeessfully  sort  all 
returned  doeuments  based  on  this  weight. 

The  eomplexity  of  the  learned  rules  varied  from  one  to  three  terms  throughout  the 
tests.  However,  IQA  only  learned  three-term  rules  through  manually  testing  with  the 
‘Shape  Dataset’.  All  automated  tests  generated  rules  of  two  or  one  terms.  FOIL’S  greedy 
approaeh  in  learning  rules  keeps  the  number  of  rule  terms  to  those  that  return  positively 
elassilied  doeuments.  It  is  in  this  way  that  FOIL  ean  learn  two  or  more  rules  for  a  single 
seareh. 

5,2.2  Rule  Generalization 

Rule  generalization  eonsists  of  pseudo  generalization,  rule  promotion,  rule 
absorption  and  rule  aging.  IQA  uses  existing  rules  and  ereates  new  pseudo  rules  by 
generalizing  semantieally  similar  terms.  Pseudo  rules  provide  an  additional  set  of 
doeuments  when  rules  for  the  existing  seareh  terms  do  not  exist.  This  funetion  assists  the 
user  finding  the  most  relevant  doeuments  based  upon  previous  searehes  (learned  rules.) 

Tests  show  that  IQA’s  pseudo  rule  generalization  assists  in  finding  doeuments. 
The  semantie  distanee  threshold  (5)  sueeessfully  limits  rule  generalization.  When  rules 
exist  for  the  seareh  terms,  any  doeuments  returned  by  generalized  rules  have  the  expeeted 
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lower  relevance  value  that  documents  returned  by  existing  rules.  However,  a  generalized 
rule  can  still  have  a  remarkably  large  rule  weight  if  the  semantically  similar  rule  had  a 
large  rule  weight.  Testing  both  scenarios  ensures  rule  weights  and  pseudo-generalized 
rule  weights  correctly  compute  the  weight  based  on  the  original  rule  weight,  5  and 
number  of  existing  rules. 

Tests  of  the  rule  promotion  show  that  IQA  creates  a  new  rule  from  instances  of 
existing  rules  when  it  finds  two  or  more  matching  rules  that  are  separate  only  by  a  term 
within  the  6.  The  new  rule  is  a  generalized  version  of  the  previous  rules  (promoted.)  The 
promotion  algorithm  also  correctly  computes  the  generalized  rule  weight  based  on  the 
existing  rules  that  matched,  and  then  deletes  those  matching  rules. 

Rule  absorption  and  rule  aging  delete  rules  that  add  no  value  to  IQA.  Rule 
absorption  occurs  when  a  more  general  form  of  the  rule  exists.  Rule  aging  occurs  when 
the  candidate  rule  weight  is  less  than  0.0125.  Each  function  works  as  tested. 

5,3  Automated  testing 

IQA  runs  through  a  series  of  automated  tests  designed  to  simulate  a  live  user 
searching  the  system  and  compare  the  accuracy  of  returned  document  order.  Tests  use 
both  test  data  sets  and  are  completed  using  a  negative  classification  wheel  spin  (to 
simulate  user  error)  and  again  with  no  negative  classification  (to  simulate  perfect  user 
classification.) 

Figure  5-1  shows  the  complete  ‘Shape  Dataset’  results.  IQA  outperforms  the  raw 
document  count  in  every  simulation.  IQA’s  document  return  order  is  more  accurate  by 
more  than  9%  in  some  instances  to  that  of  the  raw  document  count. 
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Table  5-1:  ‘Shape  Dataset’  Complete  Results 


Shape 

Test 

IQA 

Mean 

Doc 

Mean 

Rules 

Learned 

Term 

Generalized 

2  Term  1 

71.15% 

65.11% 

6.00 

6.00 

81.23% 

68.74% 

6.00 

6.00 

75.71% 

75.79% 

4.00 

4.00 

82.01% 

74.24% 

4.00 

4.00 

68.41% 

68.21% 

8.00 

7.00 

Mean 

75.70% 

70.42% 

5.60 

5.40 

StDev 

6.01% 

4.45% 

1.67 

1.34 

Shape 

Test 

IQA 

Mean 

Doc 

Mean 

Rules 

Learned 

Term 

Generalized 

3  Term  1 

83.87% 

75.31% 

6.00 

6.00 

82.98% 

73.30% 

6.00 

6.00 

89.36% 

81.19% 

4.00 

4.00 

83.20% 

76.31% 

4.00 

4.00 

90.50% 

78.36% 

8.00 

7.00 

Mean 

85.98% 

76.89% 

5.60 

5.40 

StDev 

3.64% 

3.01% 

1.67 

1.34 

Shape 

Test 

IQA 

Mean 

Doc 

Mean 

Rules 

Learned 

Term 

Generalized 

3  Term  2 

93.91% 

94.16% 

2.00 

2.00 

81.04% 

73.15% 

4.00 

4.00 

93.07% 

92.53% 

3.00 

4.00 

72.58% 

71.50% 

3.00 

3.00 

75.97% 

70.02% 

3.00 

3.00 

Mean 

83.31% 

80.27% 

3.00 

3.20 

StDev 

9.77% 

12.00% 

0.71 

0.84 

Table  5-2  shows  the  eonfidenee  intervals  for  the  ‘Shape  Dataset’  tests.  While  the 
individual  tests  show  IQA  is  more  aeeurate,  the  eonfidenee  tests  reveal  this  in  only 
eonsistently  true  in  the  3 -Term- 1  test.  In  the  2-Term- 1  test,  there  is  only  a  74% 
eonfidenee  that  IQA  will  be  more  aeeurate  that  the  raw  doeument  eount.  That  eonfidenee 
drops  to  24%  with  the  3 -Term-2  test.  Therefore,  these  two  tests  are  statistieally 
equivalent. 
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Table  5-2:  ‘Shape  Dataset’  Confidence  Intervals 


Shape  Test  2-Term-1 

Alpha 

IQA  Mean 

Doc 

Mean 

Mean 

Diff 

Qverlap 

Confidence 

0.05 

5.27% 

3.90% 

5.29% 

-3.88% 

0.10 

4.42% 

3.28% 

5.29% 

-2.41% 

0.20 

3.44% 

2.55% 

5.29% 

-0.71% 

0.25 

3.09% 

2.29% 

5.29% 

-0.10% 

0.26 

3.03% 

2.24% 

5.29% 

0.02% 

74.00% 

Shape  Test  3-Term-1 

Alpha 

IQA  Mean 

Doc 

Mean 

Mean 

Diff 

Qverlap 

Confidence 

0.01 

4.19% 

3.47% 

9.09% 

1.42% 

99.00% 

0.05 

3.19% 

2.64% 

9.09% 

3.26% 

95.00% 

0.10 

2.68% 

2.22% 

9.09% 

4.19% 

90.00% 

0.25 

1.87% 

1.55% 

9.09% 

5.67% 

75.00% 

0.50 

1.10% 

0.91% 

9.09% 

7.08% 

50.00% 

Shape  Test  3-Term-2 

Alpha 

IQA  Mean 

Doc 

Mean 

Mean 

Diff 

Qverlap 

Confidence 

0.05 

8.56% 

10.52% 

3.04% 

-16.04% 

0.10 

7.19% 

8.83% 

3.04% 

-12.97% 

0.20 

5.60% 

6.88% 

3.04% 

-9.43% 

0.25 

5.03% 

6.17% 

3.04% 

-8.15% 

0.76 

1.33% 

1.64% 

3.04% 

0.07% 

24.00% 

Table  5-3  shows  the  eomplete  ‘NAIC  Dataset”  test  results.  The  raw  doeument 
count  outperformed  with  a  loosely  bound  search  (1  positive  term.).  However,  IQA 
outperformed  the  raw  document  count  on  a  tightly  positively  bound  search.  This  could  be 
due  to  the  structure  of  the  ‘NAIC  Dataset.  ’  The  NAIC  data  is  very  structured  in  the  sense 
that  the  first  few  terms  are  always  positional  [CLOSE  RIGHT  FRONT  GROUND...] 
followed  by  an  object  [MIG-21],  followed  by  information  on  the  country  markings 
[WITH  POLISH  MARKINGS].  IQA  makes  no  assumptions  about  this  data  and  therefore 
uses  none  of  this  background  knowledge  to  learn  and  generalize  rules  better. 
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The  generalized  terms  are  lower  in  the  ‘NAIC’  Dataset  tests  than  in  the  ‘SHAPE 
Dataset’  tests.  This  result  is  expeeted  during  testing,  since  a  term  can  only  be  generalized 
if  a  majority  of  child  nodes  under  the  same  parent  has  similar  rules.  The  automated 
testing  exploited  only  a  few  of  these  possible  generalizations,  and  this  in  turn  could  have 
an  affect  on  the  overall  IQA  accuracy  results. 


Table  5-3:  ‘NAIC  Dataset’  Complete  Results 


NAIC  Test 

IQA 

Mean 

Doc 

Mean 

Rules 

Learned 

Term 

Generalized 

4  Term  1 

79.16% 

80.88% 

38.00 

1.00 

79.62% 

80.68% 

35.00 

3.00 

78.02% 

80.15% 

40.00 

5.00 

80.69% 

82.27% 

37.00 

3.00 

79.83% 

83.13% 

32.00 

5.00 

Mean 

79.47% 

81.42% 

36.40 

3.40 

StDev 

0.98% 

1.24% 

3.05 

1.67 

NAIC  Test 

IQA 

Mean 

Doc 

Mean 

Rules 

Learned 

Term 

Generalized 

4  Term  3 

88.19% 

86.61% 

30.00 

6.00 

89.43% 

85.72% 

27.00 

4.00 

88.29% 

85.79% 

27.00 

7.00 

88.89% 

84.43% 

23.00 

6.00 

88.10% 

84.39% 

26.00 

4.00 

Mean 

88.58% 

85.39% 

26.60 

5.40 

StDev 

0.57% 

0.96% 

2.51 

1.34 

Table  5-3:  ‘NAIC  Dataset’  Complete  Results 


Table  5-4  shows  the  confidence  intervals  for  the  ‘NAIC  Dataset’  tests.  The  results 
are  inconsistent  between  the  two  tests  The  4-Term- 1  test  shows  that  the  IQA  and  raw 
document  counts  are  statistically  equivalent,  while  the  4-Term-3  test  shows  IQA 
performs  a  bit  better  than  the  raw  document  count. 

Overall,  IQA  outperformed  the  raw  document  count  in  four  out  of  five  tests. 
Confidence  intervals  conclude  that  in  three  of  these  tests  results  are  statistically  the  same. 
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while  in  the  other  two  IQA  performs  somewhat  better  than  the  raw  doeument  eount. 
Manual  testing  shows  the  effeetiveness  of  rule  learning  and  generalization,  while  the 
automated  testing  eonfirms  that  it  is  no  worse  than  a  raw  doeument  eount  in  all  instanees. 
In  some  seenarios,  IQA  outperforms  the  raw  doeument  eount. 


Table  5-4:  ‘NAIC  Dataset’  Confidence  Intervals 


NAIC  Test  4-Term-1 

Alpha 

IQA  Mean 

Doc 

Mean 

Mean  Diff 

Qverlap 

Confidence 

0.05 

0.86% 

1.08% 

-1.96% 

-3.90% 

0.10 

0.72% 

0.91% 

-1.96% 

-3.59% 

0.20 

0.56% 

0.71% 

-1.96% 

-3.23% 

0.25 

0.50% 

0.64% 

-1.96% 

-3.10% 

0.50 

0.30% 

0.37% 

-1.96% 

-2.62% 

NAIC  Test  4-Term-3 

Alpha 

IQA  Mean 

Doc 

Mean 

Mean  Diff 

Qverlap 

Confidence 

0.01 

0.65% 

1.11% 

3.19% 

1.43% 

99.00% 

0.10 

0.42% 

0.71% 

3.19% 

2.06% 

90.00% 

0.20 

0.33% 

0.55% 

3.19% 

2.31% 

80.00% 

0.25 

0.29% 

0.49% 

3.19% 

2.40% 

75.00% 

0.50 

0.17% 

0.29% 

3.19% 

2.73% 

50.00% 

5,4  Outline  of  F uture  Work 

This  thesis  provides  a  fundamental  look  at  the  eoneepts  of  blending  rule  learning 
and  rule  generalization,  as  well  as  providing  the  foundation  for  future  researeh  in  this 
area.  The  eoneepts  behind  IQA  have  relevanee  in  almost  any  database  or  web  seareh 
applieation.  Those  eoneepts  eould  improve  an  individual  or  group  of  individuals’  abilities 
to  get  relevant  result  from  a  large  dataset  after  on  a  few  seareh  iterations.  This  seetion 
also  diseusses  a  more  robust  implementation  using  group  rule  sets,  a  semantie  tree 
builder/leamer,  database  implementation,  and  a  WordNet  plug  in. 
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5.4.1  Local  versus  Group  Rule  Pseudo  Generalization 

IQA  learns  rules  for  one  user.  It  can  be  expanded  in  such  a  way  to  provide  a  new 
user  access  to  a  global  set  of  strong  rules  to  pseudo  generalize  against  for  a  limited  time, 
or  until  a  threshold  if  personal  rules  were  established.  This  global  rule  set  would  provide 
an  initial  basis  for  a  user  to  search  a  large  data  structure,  and  has  the  potential  for 
providing  relevant  documents  more  effectively  than  individual  rule  learning. 

5.4.2  Semantic  Tree  Builder 

The  semantic  tree  forms  the  basis  for  rule  generalization.  The  process  of  manually 
building  an  effective  semantic  tree  is  tedious.  An  alternative  is  the  development  of  a 
semantic  tree  builder  that  builds  a  list  of  all  terms  from  the  database,  and  then  allows  the 
user  to  build  the  tree  interactively.  Another  method  would  be  to  learn  the  semantic  tree 
structure  dynamically  from  user  input. 

5.4.3  Database  Implementation 

For  the  scope  of  this  research,  the  Java  data  structures  were  adequate  to  support 
the  implementation  and  test  of  IQA  at  a  rudimentary  level.  To  implement  these  concepts 
on  a  larger  scale,  the  rule  learning  and  generalization  aspects  need  incorporation  into  an 
ODCB  compliant  structure  such  as  SQL-Server  or  Oracle  to  support  large  data  sets.  The 
NAIC  lEC  system  has  been  the  subject  of  many  research  efforts  [BAK03,  WIL03  and 
SPL04]  and  it  is  possible  that  one  of  these  approaches  could  benefit  from  integration  with 
IQA. 
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5,4,4  WordNet  Interface 


One  of  this  research  challenges  is  building  the  semantic  tree.  WordNet  provided 
the  concepts  for  IQA’s  semantic  tree  and  aided  with  term  placement.  However,  to 
incorporate  WordNet  into  IQA  would  have  required  a  complete  rewrite  of  both 
applications.  Still,  future  research  to  incorporate  WordNet  into  an  open  structure  would 
be  valuable.  Such  a  plug  and  play  structure  would  allow  future  machine  learning 
approaches  to  searching  databases  using  rule  generalization  without  continuously 
reinventing  the  wheel. 

5,5  Summary 

This  research  examines  the  problem  of  querying  a  database  with  large  amounts  of 
information  and  returning  only  the  most  relevant  records  to  the  user.  Specifically,  the 
problem  is  the  effective  retrieval  of  desired  records  in  a  relevant  order  without  a 
tremendous  amount  of  data  preprocessing.  The  research  integrates  one  popular  approach, 
rule  learning  using  FOIL,  with  the  more  obscure  concept  of  semantic  generalization.  The 
result  is  a  rule  learning  system  that  can  also  generalize  rules  across  a  predefined  semantic 
tree.  This  research  provides  a  first  look  at  this  novel  combination  and  demonstrates  the 
capabilities  against  two  sets  of  data.  It  also  provides  ideas  for  future  studies  to  extend 
these  concepts. 
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Appendix  A  -  Semantic  Trees 


1.  Raw  ‘Shape  Dataset’  Semantic  Data 

WORD, null , NOUN, ADJECTIVE 

NOUN, WORD, SHAPE 

ADJECTIVE, WORD, ATTRIBUTE 

SHAPE, NOUN, CIRCLE, SQUARE, TRIANGLE 

CIRCLE, SHAPE, null 

SQUARE, SHAPE, null 

TRIANGLE, SHAPE, null 

ATTRIBUTE, ADJECTIVE, SIZE, COLOR 

SIZE, ATTRIBUTE, SMALL, MEDIUM, BIG 

SMALL, SIZE, null 

MEDIUM, SIZE, null 

BIG, SIZE, null 

COLOR, ATTRIBUTE, RED, BLUE, GREEN 
RED, COLOR, null 
BLUE, COLOR, null 
GREEN, COLOR, null 

2.  Raw  ‘NAIC  Dataset’  Semantic  Data 

WORD, null, OBJECT, DESCRIPTOR 

OBJECT, WORD, AIRCRAFT, SPACE_VEHICLE, GUIDED_MISSILES 
AIRCRAFT, OBJECT, MANNED, UAV 

MANNED,  AIRCRAFT, A-4 7, A5C,A-5 1 1 1, AN-2  6, ALPHA_JET, AN- 12, AN- 12 4, AN- 14  0, AN- 

225,  AN-225,  AN-26,  AN-3,  AN-32,  AN-71,  AN-74,  AN-7  4 -3  00,  AN-7  4T-2  00,  AN-7  4TK- 

200, ANTONOV, B7 07 , C- 1 60 , CANBERRA, CHEETAH-C, CHEETAH-D, DORNIER- 

22  8,  EUROFIGHTER, TYPHOON, F-15, F-16, F-2, F-4, F-6, F-7MF, F-7MG, F-8IIACT, F- 

8IIM, FB-7, FBC-1, FC-1, FT-7PG, FTC-2  0 00 , HAL, JAGUAR, JAS-39,K-8,KA-2  8,KA- 

52, KMH, KT-1, L15, L-159, L-39, LCA,MB-32 6, MI-17-V5, MIG-2 1,MIG-27M, MIG- 

29,  MIRAGE_2000,MIRAGE_FICR,MIRAGE_FICT,  PC-12,  RAFALE, SU-22, SU-22M4, SU- 

24MK, SU-27FLANKER, SU-30MK, SU-32, SU-33, SU-3 9 , ETENDARD, MK-I I I , T- 

50, TORNADO, TU-334, XXJ, Y7H-500, Y8F400, YAK-130, Z-8, Z-9G 

A- 4 7, MANNED, DAKOTA 

DAKOTA, A- 4 7, null 

A5C, MANNED, null 

A-5 I I I, MANNED, FANTANS 

FANTANS, A-5III, null 

AN-2  6,  MANNED, null 

ALPHA_ JET, MANNED, null 

AN-12,  MANNED, CUB 

CUB, AN-12,  null 

AN-12  4,  MANNED, CONDOR 

CONDOR, AN-124,  null 

AN- 140, MANNED, null 

AN-225,  MANNED, null 

AN-225,  MANNED, COSSACK 

COSSACK, AN-225,  null 

AN-26,  MANNED, CURL 
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CURL, AN-26,  null 
AN-3,  MANNED, null 

AN-32,  MANNED, CLINE, AN-32B, AN-32P 

CLINE, AN-32,  null 

AN-32B, AN-32,  null 

AN-32P, AN-32,  null 

AN-71,  MANNED, null 

AN-74 , MANNED, null 

AN-7 4 -300, MANNED, null 

AN-74T-2 00, MANNED, null 

AN-74TK-2 00, MANNED, null 

ANTONOV, MANNED, null 

B7 07, MANNED, COMINT , COMJAM 

COMINT, B707, null 

COMJAM, B7 07, null 

C-160, MANNED, GABRIEL 

GABRIEL, C-160, null 

CANBERRA, MANNED, null 

CHEETAH-C, MANNED, null 

CHEETAH-D, MANNED, null 

DORNIER-228, MANNED, null 

EUROFIGHTER, MANNED, null 

TYPHOON, MANNED, null 

F-15, MANNED, null 

F-16, MANNED, F-16A 

F-16A, F-16, null 

F-2, MANNED, null 

F- 4, MANNED, PHANTOM 

PHANTOM, F-4, null 

F-6, MANNED, null 

F-7MF, MANNED, null 

F-7MG, MANNED, null 

F-8IIACT,MANNED, null 

F-8IIM,MANNED, null 

FB-7, MANNED, null 

FBC-1 , MANNED, null 

FC-1, MANNED, null 

FT-7PG, MANNED, null 

FTC-2  000,  MANNED, null 

HAL, MANNED, null 

JAGUAR, MANNED, null 

JAS-39,MANNED, GRIPEN, JAS-39A 

GRIPEN, JAS-39, null 

JAS-39A, JAS-39, null 

K-8, MANNED, null 

KA-2  8 , MANNED, KAMOV 

KAMOV, KA-2 8, null 

KA-52, MANNED, ALLIGATOR 

ALLIGATOR, KA-52, null 

KMH, MANNED, null 

KT-1 , MANNED, WOONGBEE 

WOONGBEE, KT-1, null 

L15, MANNED, null 

L-159, MANNED, null 

L-39, MANNED, null 


88 


LCA, MANNED, LCA-NAVY 
LCA-NAVY, LCA, null 
MB-326, MANNED, IMPALA 
IMPALA,MB-326, null 
MI-17-V5, MANNED, null 

MIG-2 1, MANNED, MIG-2  IBIS, FI SHBED, MIG-2 1-93, MIG-2  IMF, MIG-2 lUM 

BIS, MIG-21, null 

FISHBED, MIG-21,  null 

MIG-21-93,  MIG-21,  null 

MIG-21BIS, MIG-21,  null 

MIG-21MF, MIG-21,  null 

MIG-21UM, MIG-21,  null 

MIG-27M, MANNED, null 

MIG-2  9,  MANNED, null 

MI RAGE_2  000, MANNED , MI RAGE_2  0  0  0 - 5  F , MI RAGE_2  0  0  0  D 

MIRAGE_2000-5F,MIRAGE_2000, null 

MI RAGE_2  0  0  0  D , MI RAGE_2  0  0  0 , nu 1 1 

MIRAGE_F1CR, MANNED, null 

MI RAGE_F1CT, MANNED, null 

PC-12 , MANNED, null 

RAFALE , MANNED , RAFALE_B , RAFALE_B -01, RAFALE_M , RAFALE_M- 0  2 , RAFALE_M  9 

RAFALE_B, RAFALE, null 

RAFALE_B -01, RAFALE , nu 1 1 

RAFALE_M, RAFALE, null 

RAFALE_M- 0  2 , RAFALE , nu 1 1 

RAFALE_M  9 , RAFALE , nu 1 1 

SU-22, MANNED, SU-22M3 

SU-22M3, SU-22, null 

SU-22M4 , MANNED, FITTER-K 

FITTER-K, SU-22M4, null 

SU-24MK, MANNED, null 

SU-27FLANKER, MANNED, FLANKER, SU-27SK, FLANKER-B 

FLANKER, SU-27, null 

SU-27SK, SU-27, null 

FLANKER-B, SU-27, null 

SU-30MK, MANNED, null 

SU-32, MANNED, null 

SU-33, MANNED, null 

SU-3 9, MANNED, null 

ETENDARD, MANNED, null 

MK- I I I, MANNED, SUPERHIND 

SUPERHIND, MK- I I I, null 

T-50, MANNED, null 

TORNADO, MANNED, GR- 4, ILS 

GR- 4, TORNADO, null 

ILS, TORNADO, null 

TU-334, MANNED, null 

XXJ, MANNED, null 

Y7H-5 00, MANNED, null 

Y8F4 00, MANNED, null 

YAK- 130, MANNED, null 

Z-8, MANNED, null 

Z-9G, MANNED, null 

UAV, AIRCRAFT , 3  5  OENGINE , AERONEF , AEROSKY , AEROSONDE , AW- 

4, BREZEL/KZO, C22 , CHUNGSHYANG- I I , CKIG, EAGLE, FOXAT3 , FOXMLCS , HERMES , MINI - 
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V, HERMES1500,HERMES180,HW- 

02, LARK, MICRO, NISHANT, PETITDUC, PHANTOMEYE , REMEZ-3 , S-100,S-45,S-70, SA- 

6, SAABSHARC, SCOUT-II, SEAMOS, SEEKER, SEEKER-II, SHADOW- 

200, SHARC, SKUA, TAILSITTER, VULTURE, W-50 , WZ-2000 

350ENGINE, UAV, null 

AERONEF, UAV, null 

AEROSKY, UAV, null 

AEROSONDE, UAV, null 

AW- 4, UAV, SHARK- I I 

SHARK-II, AW-4, null 

BREZEL/KZO, UAV, null 

C22, UAV, null 

CHUNGSHYANG-II, UAV, null 

CKIG, UAV, null 

EAGLE, UAV, null 

FOXAT3, UAV, null 

FOXMLCS, UAV, null 

HERMES, UAV, null 

MINI-V, UAV, null 

HERMES1500, UAV, null 

HERMES 180, UAV, null 

HW-02, UAV, null 

LARK, UAV, null 

MICRO, UAV, null 

NISHANT, UAV, null 

PETITDUC, UAV, null 

PHANTOMEYE, UAV, null 

REMEZ-3, UAV, null 

S-100, UAV, null 

S-45, UAV, null 

S-70, UAV, null 

SA-6, UAV, null 

SAABSHARC, UAV, null 

SCOUT-II, UAV, null 

SEAMOS, UAV, null 

SEEKER, UAV, null 

SEEKER-II, UAV, null 

SHADOW-2  00, UAV, null 

SHARC, UAV, null 

SKUA, UAV, null 

TAILSITTER, UAV, null 

VULTURE, UAV, null 

W-50, UAV, null 

WZ-2000, UAV, null 

GUIDED_MISSILES, OBJECT, LGB, AA- 1 0 , AA- 1 1 , AA- 12 , AA- 8 , A-DARTER, R- 
DARTER, UMKHONTO-IR,AGM-65,APS-784,AS-10,AS-11,AS-12,AS-15B,AS- 
17,KRYPTON,AS-18,AS-20,KAYAK,ASTER-30,BRAHMOS,C-301,C-701,C- 
802, CINGOZ, CKIG, CROTALE, DZ-88, FLG-1, FLS-1, FLV-1, FM-90N, FN-6, HJ-8 , HN- 
5, HQ-2B, 11-2000, INGWE, I-TALD, JLl, KEPD-350, KH-59MK, KS-IA, LY-60,MAGIC- 
2,METIS-M,MICA,MK-80,MM-2000,MOKOPA,MOSKIT-E, PAVEWAY- I I I , MK2 , PL-5E, PL- 
9C, TY-90, QW-1, QW-3, QW-2, QW-3, QW-3, QW-4, RAPTOR-I, RAPTOR-II, SA-10, SA- 
16, SA-6, SAHV-3, SAHV-IR, SAMOC, STORM, SHADOW/SCALPEG, TIENCHIEN- I I , TY- 
90,  PL-5E, PL-9C, UA-424, UMKHONTO, UMKHONTO-IR, A-DARTER, WS-1 
LGB, GUIDED_MISSILES, null 
AA-10, GUIDED  MI SS ILFS , ALAMO 
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ALAMO, AA- 10, null 
AA-11, GUIDED_MISSILES, null 
AA-12, GUIDED_MISSILES, null 
AA-8, GUIDED_MISSILES, null 
A-DARTER, GUIDED_MI SS ILES , null 
R-DARTER, GUIDED_MI SS ILES , null 
UMKHONTO-IR, GUIDED_MI SS ILES , null 

AGM-65, GUIDED_MISSILES, AGM-65D, AGM-65E, AGM-65B, AGM-65F 

AGM-65D, AGM-65, null 

AGM-65E, AGM-65, null 

AGM-65B, AGM-65, null 

AGM-65F, AGM-65, null 

APS-784, GUIDED_MISSILES, null 

AS-10, GUIDED_MISSILES, KAREN 

KAREN, AS-10, null 

AS-11, GUIDED_MISSILES, KILTER 

KILTER, AS-11, null 

AS-12, GUIDED_MISSILES, KEGLER 

KEGLER, AS-12, null 

AS-15B, GUIDED_MISSILES, KENT 

KENT, AS-15B, null 

AS-17, GUIDED_MISSILES, KRYPTON 

KRYPTON, GUIDED_MISSILES, null 

AS-18, GUIDED_MISSILES, KAZOO, AS-18M 

KAZOO, AS-18, null 

AS-18M, AS-18, null 

AS-20, GUIDED_MISSILES, KAYAK 

KAYAK, AS-20, null 

ASTER-30, GUIDED_MISSILES, null 

BRAHMOS, GUIDED_MISSILES, null 

C-301, GUIDED_MISS ILES, null 

C-701, GUIDED_MISS ILES, null 

C-802, GUIDED_MISSILES, null 

CINGOZ, GUIDED_MISSILES, null 

CKIG, GUIDED_MISSILES, null 

CROTALE, GUIDED_MISSILES, null 

DZ-88, GUIDED_MISSILES, null 

FLG-1, GUIDED_MISSILES, null 

FLS-1, GUIDED_MISSILES, null 

FLV-1, GUIDED_MISSILES, null 

FM-90N, GUIDED_MISSILES, null 

FN-6, GUIDED_MISSILES, null 

HJ-8, GUIDED_MISSILES, RED_ARROW 

RED_ARROW, HJ-8, null 

HN-5, GUIDED_MISSILES, null 

HQ-2B, GUIDED_MISSILES, null 

11-2000, GUIDED_MISS ILES, null 

INGWE, GUIDED_MISSILES, null 

I-TALD, GUIDED_MISSILES, null 

JLl, GUIDED_MISSILES, null 

KEPD-350, GUIDED_MISSILES, null 

KH-59MK, GUIDED_MISSILES, null 

KS-IA, GUIDED_MISSILES, null 

LY-60, GUIDED_MISSILES, null 

MAGIC-2,  GUIDED  MI SS ILES , null 
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METIS-M, GUIDED_MISSILES, null 
MICA, GUIDED_MISSILES, null 

MK-80, GUIDED_MISSILES, null 
MM-2000, GUIDED_MISSILES, null 
MOKOPA, GUIDED_MISSILES, null 
MOSKIT-E, GUIDED_MISSILES, null 
PAVEWAY-III, GUIDED_MISSILES, null 
MK2, GUIDED_MISSILES, PENGUIN 
PENGUIN, MK2, null 
PL-5E, GUIDED_MISSILES, null 
PL-9C, GUIDED_MISSILES, null 
TY-90, GUIDED_MISSILES, null 
QW-1, GUIDED_MISSILES, QW-IA, QW-IM 
QW-IA, QW-1, null 
QW-IM, QW-1, null 
QW-3, GUIDED_MISSILES, null 
QW-2, GUIDED_MISSILES, null 
QW-3, GUIDED_MISSILES, null 
QW-3, GUIDED_MISSILES, null 
QW-4, GUIDED_MISSILES, null 
RAPTOR-I, GUIDED_MISSILES, null 
RAPTOR-II, GUIDED_MISSILES, null 
SA-10, GUIDED_MISSILES, null 
SA-16, GUIDED_MISSILES, GIMLET 
GIMLET, SA-16, null 
SA-6, GUIDED_MISSILES, null 
SAHV-3, GUIDED_MISSILES, null 
SAHV-IR, GUIDED_MISSILES, null 
SAMOC, GUIDED_MISSILES, null 
STORM, GUIDED_MISSILES, null 

SHADOW/ SCALPEG, GUIDED_MI SS ILES , null 
TIENCHIEN-II, GUIDED_MISSILES, null 
TY-90, GUIDED_MISSILES, null 
PL-5E, GUIDED_MISSILES, null 
PL-9C, GUIDED_MISSILES, null 
UA-424, GUIDED_MISSILES, null 
UMKHONTO, GUIDED_MISSILES, null 
UMKHONTO-IR, GUIDED_MI SS ILES , null 
A-DARTER, GUIDED_MI SS ILES , null 
WS-1, GUIDED_MISSILES, WS-IB 
WS-lB,WS-l,null 

SPACE_VEHICLE, OBJECT, CYCLONE-4, ZENIT-3SL, CZ-2E, CZ-2F, CZ-3A, CZ-3B, CZ- 
3C, DFH-1, HANGTIAN_TSINGHUA-1, HY-1, LM-2C/SD, LM-2E, LM-3A, LM-3B, KT-1, KT- 
2, SHENZHOU-1 

CYCLONE-4, SPACE_VEHICLE, null 
ZENIT-3SL, SPACE_VEHICLE, null 
CZ-2E, SPACE_VEHICLE, LONGMARCH 
LONGMARCH, CZ-2E, null 
CZ-2F, SPACE_VEHICLE, null 
CZ-3A, SPACE_VEHICLE, null 
CZ-3B, SPACE_VEHICLE, null 
CZ-3C, SPACE_VEHICLE, null 
DFH-1, SPACE  VEHICLE, null 
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HANGTIAN_TSINGHUA-1, SPACE_VEHICLE, null 

HY-1, SPACE_VEHICLE, OCEAN 

OCEAN, HY-1, nullA 

LM-2C/SD, SPACE_VEHICLE, null 

LM-2E, SPACE_VEHICLE, null 

LM-3A, SPACE_VEHICLE, null 

LM-3B, SPACE_VEHICLE, null 

KT-1, SPACE_VEHICLE, PIONEER-1 

PIONEER-1, KT-1, null 

KT-2, SPACE_VEHICLE, PIONEER-2, PI0NEER-2A 
PIONEER-2 , KT-2 , null 

PI0NEER-2A, KT-2 , null 
SHENZHOU-1, SPACE_VEHICLE, null 

DESCRIPTOR, WORD, DISTANCE, POSITION, LOCATION, VIEW, MARKINGS 
CLOSE, DISTANCE, null 
MEDIUM, DISTANCE, null 
DISTANT, DISTANCE, , null 

DISTANCE, DESCRIPTOR, CLOSE , MEDIUM, DISTANT 
POSITION, DESCRIPTOR, LEFT, RIGHT 
LEFT, POSITION, null 
RIGHT, POSITION, null 

LOCATION, DESCRIPTOR, FRONT, REAR, INTERIOR, SIDE, UNDERSIDE, OVERHEAD, AMOUNT 

FRONT, LOCATION, null 

REAR, LOCATION, null 

INTERIOR, LOCATION, null 

SIDE, LOCATION, null 

UNDERSIDE, LOCATION, null 

OVERHEAD, LOCATION, null 

VIEW, DESCRIPTOR, GROUND, INFLIGHT 

GROUND, VIEW, null 

INFLIGHT, VIEW, null 

AMOUNT, DESCRIPTOR, PARTIAL, FULL 

PARTIAL, AMOUNT, null 

FULL, AMOUNT, null 

MARKINGS, DESCRIPTOR, PAKISTANI , UKRAINIAN, PERUVIAN, AFRICAN, UK, US, KOREAN, J 

APANESE, TURKISH, FRENCH, SWEDISH, RUSSIAN, YEMEN, INDIAN, CZECH, POLISH, YUGOSL 

AVIAN, BANGLADESH, UGANDAN, CHINESE, GERMAN, RSAF 

PAKISTANI, MARKINGS, null 

UKRAINIAN, MARKINGS, null 

PERUVIAN, MARKINGS, null 

AFRICAN, MARKINGS, null 

UK, MARKINGS, null 

US, MARKINGS, null 

KOREAN , MARK INGS,null 

JAPANESE, MARKINGS, null 

TURKISH, MARKINGS, null 

FRENCH, MARKINGS, null 

SWEDISH, MARKINGS, null 

RUSSIAN, MARKINGS, null 

YEMEN, MARKINGS, null 

INDIAN, MARKINGS, null 

CZECH, MARKINGS, null 

POLISH, MARKINGS, null 

YUGOSLAVIAN, MARKINGS, null 
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BANGLADESH, MARKINGS, null 
UGANDAN, MARKINGS, null 
CHINESE, MARKINGS, null 
GERMAN, MARKINGS, null 
RSAF, MARKINGS, null 
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Appendix  B  -  Winnow  Rule  Aging  Calculations 


Initial 

Classification 

Initial 

Classification 

1  Pos 

2  Pos 

3  Pos 

4  Pos 

5  Pos 

6  Pos 

Subsequent 

Iterations 

I.OOOOO 

1.00000 

1.50000 

2.25000 

3.37500 

5.06250 

7.59375 

11.39063 

1 

0.90000 

0.50000^ 

1.35000 

2.02500 

1.68750 

2.53125 

3.79688 

5.69531 

2 

0.81000 

0.25000 

1.21500 

1.82250 

0.84375 

1.26563 

1.89844 

2.84766 

3 

0.72900 

0.12500 

1.09350 

1.64025 

0.75938 

0.63281 

0.94922 

1.42383 

4 

0.65610 

0.06250 

0.98415 

1.47623 

0.68344 

0.31641 

0.47461 

0.71191 

5 

0.59049 

0.03125 

0.88574 

1.32860 

0.61509 

0.28477 

0.23730 

0.35596 

6 

0.53144 

0.01563 

0.79716 

1.19574 

0.55358 

0.25629 

0.11865 

0.17798 

7 

0.47830 

0.00781 

0.71745 

1.07617 

0.49823 

0.23066 

0.10679 

0.08899 

8 

0.43047 

0.00703 

0.64570 

0.96855 

0.44840 

0.20759 

0.09611 

0.04449 

9 

0.38742 

0.00633 

0.58113 

0.87170 

0.40356 

0.18683 

0.08650 

0.04005 

10-17 

18 

0.15009 

0.00245 

0.22514 

0.33771 

0.15635 

0.07238 

0.03351 

0.01551 

19 

0.13509 

0.00221 

0.20263 

0.30394 

0.14071 

0.06515 

0.03016 

0.01396 

20 

0.12158 

0.00199 

0.18236 

0.27355 

0.12664 

0.05863 

0.02714 

0.01257 

21 

0.10942 

0.00179 

0.16413 

0.24619 

0.11398 

0.05277 

0.02443 

0.01131 

22 

0.09848 

0.00161 

0.14772 

0.22157 

0.10258 

0.04749 

0.02199 

0.01018 

23 

0.08863 

0.00145 

0.13294 

0.19942 

0.09232 

0.04274 

0.01979 

0.00916 

24 

0.07977 

0.00130 

0.11965 

0.17947 

0.08309 

0.03847 

0.01781 

0.00824 

25 

0.07179 

0.00117 

0.10768 

0.16153 

0.07478 

0.03462 

0.01603 

0.00742 

26 

0.06461 

0.00106 

0.09692 

0.14537 

0.06730 

0.03116 

0.01443 

0.00668 

27 

0.05815 

0.00095 

0.08722 

0.13084 

0.06057 

0.02804 

0.01298 

0.00601 

28 

0.05233 

0.00085 

0.07850 

0.11775 

0.05452 

0.02524 

0.01168 

0.00541 

29 

0.04710 

0.00077 

0.07065 

0.10598 

0.04906 

0.02271 

0.01052 

0.00487 

30 

0.04239 

0.00069 

0.06359 

0.09538 

0.04416 

0.02044 

0.00946 

0.00438 

31 

0.03815 

0.00062 

0.05723 

0.08584 

0.03974 

0.01840 

0.00852 

0.00394 

32 

0.03434 

0.00056 

0.05151 

0.07726 

0.03577 

0.01656 

0.00767 

0.00355 

33 

0.03090 

0.00050 

0.04635 

0.06953 

0.03219 

0.01490 

0.00690 

0.00319 

34 

0.02781 

0.00045 

0.04172 

0.06258 

0.02897 

0.01341 

0.00621 

0.00287 

35 

0.02503 

0.00041 

0.03755 

0.05632 

0.02607 

0.01207 

0.00559 

0.00259 

36 

0.02253 

0.00037 

0.03379 

0.05069 

0.02347 

0.01086 

0.00503 

0.00233 

37 

0.02028 

0.00033 

0.03041 

0.04562 

0.02112 

0.00978 

0.00453 

0.00210 

38 

0.01825 

0.00030 

0.02737 

0.04106 

0.01901 

0.00880 

0.00407 

0.00189 

39 

0.01642 

0.00027 

0.02463 

0.03695 

0.01711 

0.00792 

0.00367 

0.00170 

40 

0.01478^ 

0.00024 

0.02217 

0.03326 

0.01540 

0.00713 

0.00330 

0.00153 

41 

0.01330 

0.00022 

0.01995 

0.02993 

0.01386 

0.00642 

0.00297 

0.00138 

42 

0.01197 

0.00020 

0.01796 

0.02694 

0.01247 

0.00577 

0.00267 

0.00124 

43 

0.01078 

0.00018 

0.01616 

0.02424 

0.01122 

0.00520 

0.00241 

0.00111 

44 

0.00970 

0.00016 

0.01455 

0.02182 

0.01010 

0.00468 

0.00217 

0.00100 

45 

0.00873 

0.00014 

0.01309 

0.01964 

0.00909 

0.00421 

0.00195 

0.00090 

46 

0.00786 

0.00013 

0.01178 

0.01767 

0.00818 

0.00379 

0.00175 

0.00081 

47 

0.00707 

0.00012 

0.01060 

0.01591 

0.00736 

0.00341 

0.00158 

0.00073 

48 

0.00636 

0.00010 

0.00954 

0.01432 

0.00663 

0.00307 

0.00142 

0.00066 

^  Items  in  bold  italics  indicate  negative  classification,  and  subsequent  factoring  of  0.5. 
^  Items  in  bold  indicate  non-classification,  and  subsequent  factoring  of  0.9. 


95 


Appendix  C  -  Log  File:  IQA  Tests  Sections  4,4  -  4,5 


This  file  contains  the  comparison  data  for  testing. 


>>>>>  Search  Results  <<<<< 


Total  number  of  search  results  =  24 


SMALL, 

TRIANGLE] , 

SMALL, 

TRIANGLE] , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

SMALL, 

TRIANGLE]  , 

[BLUE,  TRIANGLE],  1.0] 

[SMALL,  GREEN,  SQUARE],  1.0] 
[SMALL,  CIRCLE],  1.0] 

[BIG,  GREEN,  TRIANGLE],  1.0] 
[RED,  TRIANGLE],  1.0] 

[BIG,  BLUE,  TRIANGLE],  1.0] 
[GREEN,  TRIANGLE],  1.0] 

[SMALL,  GREEN,  CIRCLE],  1.0] 
[MEDIUM,  TRIANGLE],  1.0] 

[SMALL,  SQUARE],  1.0] 

[MEDIUM,  GREEN,  TRIANGLE],  1.0] 
[BIG,  RED,  TRIANGLE],  1.0] 
[SMALL,  RED,  CIRCLE],  1.0] 
[TRIANGLE],  1.0] 

[BIG,  TRIANGLE],  1.0] 

[SMALL,  BLUE,  CIRCLE],  1.0] 
[SMALL,  BLUE,  TRIANGLE],  2.0] 
[MEDIUM,  RED,  TRIANGLE],  1.0] 
[MEDIUM,  BLUE,  TRIANGLE],  1.0] 
[SMALL,  TRIANGLE],  2.0] 

[SMALL,  RED,  TRIANGLE],  2.0] 
[SMALL,  BLUE,  SQUARE],  1.0] 
[SMALL,  GREEN,  TRIANGLE],  2.0] 
[SMALL,  RED,  SQUARE],  1.0] 


>>>>>  Unclassif iedResults  <<<<< 


Total  number  of  unclassified  results  =  24 


SMALL, 

TRIANGLE] , 

[SMALL, 

BLUE,  TRIANGLE], 

2 .0] 

SMALL, 

TRIANGLE] , 

[SMALL, 

TRIANGLE],  2.0] 

SMALL, 

TRIANGLE]  , 

[SMALL, 

RED,  TRIANGLE], 

2.0] 

SMALL, 

TRIANGLE]  , 

[SMALL, 

GREEN,  TRIANGLE] 

,  2.0] 

SMALL, 

TRIANGLE]  , 

[RED,  TRIANGLE],  1.0] 

SMALL, 

TRIANGLE]  , 

[BIG,  BLUE,  TRIANGLE],  1 

.0] 

SMALL, 

TRIANGLE]  , 

[GREEN, 

TRIANGLE],  1.0] 

SMALL, 

TRIANGLE]  , 

[SMALL, 

GREEN,  CIRCLE], 

1.0] 

SMALL, 

TRIANGLE]  , 

[MEDIUM, 

TRIANGLE],  1.0] 

SMALL, 

TRIANGLE]  , 

[SMALL, 

SQUARE],  1.0] 

SMALL, 

TRIANGLE]  , 

[MEDIUM, 

GREEN,  TRIANGLE],  1.0] 

SMALL, 

TRIANGLE]  , 

[BIG,  RED,  TRIANGLE],  1. 

0] 

SMALL, 

TRIANGLE]  , 

[SMALL, 

RED,  CIRCLE],  1. 

0] 

SMALL, 

TRIANGLE]  , 

[TRIANGLE],  1.0] 
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»»>  RULES  BEFORE  PROMOTION  ««< 


Total  number  of  Rules  =  1 


[[SMALL,  TRIANGLE],  [RED,  SMALL,  TRIANGLE],  1.0] 

'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k 


»»>  RULES  AFTER  PROMOTION  ««< 


Total  number  of  Rules  =  1 


[[SMALL,  TRIANGLE],  [RED,  SMALL,  TRIANGLE],  1.0] 

'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k 


This  file  contains  the  comparison  data  for  testing. 


>>>>>  Search  Results  <<<<< 


Total  number  of  search  results  =  24 


[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[BLUE,  TRIANGLE],  1.0] 

[BIG,  GREEN,  TRIANGLE],  1.0] 
[RED,  TRIANGLE],  2.0] 

[BIG,  BLUE,  TRIANGLE],  1.0] 
[GREEN,  TRIANGLE],  1.0] 

[MEDIUM,  TRIANGLE],  1.0] 
[MEDIUM,  GREEN,  TRIANGLE],  1.0] 
[BIG,  RED,  CIRCLE],  1.0] 

[BIG,  RED,  TRIANGLE],  2.0] 
[MEDIUM,  RED,  SQUARE],  1.0] 
[SMALL,  RED,  CIRCLE],  1.0] 
[TRIANGLE],  1.0] 

[BIG,  TRIANGLE],  1.0] 

[BIG,  RED,  SQUARE],  1.0] 

[SMALL,  BLUE,  TRIANGLE],  1.0] 
[RED,  CIRCLE],  1.0] 

[MEDIUM,  RED,  CIRCLE],  1.0] 
[MEDIUM,  RED,  TRIANGLE],  2.0] 
[MEDIUM,  BLUE,  TRIANGLE],  1.0] 
[SMALL,  TRIANGLE],  1.0] 

[SMALL,  RED,  TRIANGLE],  2.0] 
[SMALL,  GREEN,  TRIANGLE],  1.0] 
[SMALL,  RED,  SQUARE],  1.0] 

[RED,  SQUARE],  1.0] 


>>>>>  Unclassif ledResults  <<<<< 


Total  number  of  unclassified  results  =  24 


[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 


[RED,  TRIANGLE],  2.0] 

[BIG,  RED,  TRIANGLE],  2.0] 
[MEDIUM,  RED,  TRIANGLE],  2.0] 
[SMALL,  RED,  TRIANGLE],  2.0] 
[GREEN,  TRIANGLE],  1.0] 
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[[RED,  TRIANGLE],  [BLUE,  TRIANGLE],  1.0] 

[[RED,  TRIANGLE],  [BIG,  GREEN,  TRIANGLE],  1.0] 

[[RED,  TRIANGLE],  [RED,  TRIANGLE],  20.140625] 

[[RED,  TRIANGLE],  [BIG,  BLUE,  TRIANGLE],  1.0] 

[[RED,  TRIANGLE],  [GREEN,  TRIANGLE],  1.0] 

[[RED,  TRIANGLE],  [MEDIUM,  TRIANGLE],  1.0] 

[[RED,  TRIANGLE],  [MEDIUM,  GREEN,  TRIANGLE],  1.0] 
[[RED,  TRIANGLE],  [BIG,  RED,  CIRCLE],  1.0] 

[[RED,  TRIANGLE],  [BIG,  RED,  TRIANGLE],  20.140625] 
[[RED,  TRIANGLE],  [MEDIUM,  RED,  SQUARE],  1.0] 

[[RED,  TRIANGLE],  [SMALL,  RED,  CIRCLE],  1.0] 

[[RED,  TRIANGLE],  [TRIANGLE],  1.0] 

[[RED,  TRIANGLE],  [BIG,  TRIANGLE],  1.0] 

[[RED,  TRIANGLE],  [BIG,  RED,  SQUARE],  1.0] 

[[RED,  TRIANGLE],  [SMALL,  BLUE,  TRIANGLE],  1.0] 

[[RED,  TRIANGLE],  [RED,  CIRCLE],  1.0] 

[[RED,  TRIANGLE],  [MEDIUM,  RED,  CIRCLE],  1.0] 

[[RED,  TRIANGLE],  [MEDIUM,  RED,  TRIANGLE],  20.140625] 
[[RED,  TRIANGLE],  [MEDIUM,  BLUE,  TRIANGLE],  1.0] 
[[RED,  TRIANGLE],  [SMALL,  TRIANGLE],  1.0] 

[[RED,  TRIANGLE],  [SMALL,  RED,  TRIANGLE],  20.140625] 
[[RED,  TRIANGLE],  [SMALL,  GREEN,  TRIANGLE],  1.0] 
[[RED,  TRIANGLE],  [SMALL,  RED,  SQUARE],  1.0] 
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[[RED,  TRIANGLE],  [RED,  SQUARE],  1.0] 


>>>>>  Unclassif iedResults  <<<<< 


Total  number  of  unclassified  results  =  24 


[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED,  TRIANGLE],  20.140625] 

[BIG,  RED,  TRIANGLE],  20.140625] 
[MEDIUM,  RED,  TRIANGLE],  20.140625] 
[SMALL,  RED,  TRIANGLE],  20.140625] 
[GREEN,  TRIANGLE],  1.0] 

[MEDIUM,  TRIANGLE],  1.0] 

[MEDIUM,  GREEN,  TRIANGLE],  1.0] 
[BIG,  RED,  CIRCLE],  1.0] 

[BIG,  GREEN,  TRIANGLE],  1.0] 
[MEDIUM,  RED,  SQUARE],  1.0] 

[SMALL,  RED,  CIRCLE],  1.0] 
[TRIANGLE],  1.0] 

[BIG,  TRIANGLE],  1.0] 

[BIG,  RED,  SQUARE],  1.0] 

[SMALL,  BLUE,  TRIANGLE],  1.0] 

[RED,  CIRCLE],  1.0] 

[MEDIUM,  RED,  CIRCLE],  1.0] 

[BLUE,  TRIANGLE],  1.0] 

[MEDIUM,  BLUE,  TRIANGLE],  1.0] 
[SMALL,  TRIANGLE],  1.0] 

[BIG,  BLUE,  TRIANGLE],  1.0] 

[SMALL,  GREEN,  TRIANGLE],  1.0] 
[SMALL,  RED,  SQUARE],  1.0] 

[RED,  SQUARE],  1.0] 


>>>>>  Classification  Results  <<<<< 


Total  number  of  classified  results  =  24 


[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED,  TRIANGLE],  0] 

[BIG,  RED,  TRIANGLE],  1] 
[MEDIUM,  RED,  TRIANGLE],  0] 
[SMALL,  RED,  TRIANGLE],  0] 
[GREEN,  TRIANGLE],  0] 

[MEDIUM,  TRIANGLE],  0] 
[MEDIUM,  GREEN,  TRIANGLE],  0] 
[BIG,  RED,  CIRCLE],  0] 

[BIG,  GREEN,  TRIANGLE],  0] 
[MEDIUM,  RED,  SQUARE],  0] 
[SMALL,  RED,  CIRCLE],  0] 
[TRIANGLE],  0] 

[BIG,  TRIANGLE],  0] 

[BIG,  RED,  SQUARE],  0] 

[SMALL,  BLUE,  TRIANGLE],  0] 
[RED,  CIRCLE],  0] 

[MEDIUM,  RED,  CIRCLE],  0] 
[BLUE,  TRIANGLE],  0] 
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[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 
[[RED,  TRIANGLE], 


[MEDIUM,  BLUE,  TRIANGLE],  0] 
[SMALL,  TRIANGLE],  0] 

[BIG,  BLUE,  TRIANGLE],  0] 
[SMALL,  GREEN,  TRIANGLE],  0] 
[SMALL,  RED,  SQUARE],  0] 
[RED,  SQUARE],  0] 


»»>  NEW  RULES  ««< 


Total  number  of  new  Rules  =  3 


[[SMALL,  TRIANGLE],  [RED,  SMALL,  TRIANGLE],  0.81] 

[[RED,  TRIANGLE],  [RED,  TRIANGLE],  3.690562] 

[[RED,  TRIANGLE],  [BIG,  RED,  TRIANGLE],  1.0] 

'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k 


»»>  DOCUMENTS  FOR  COMPARISON  TO  FOIL/WINNOW  ««< 


Total  number  of  documents  in  document  file  =  0 


»»>  RULES  BEFORE  PROMOTION  ««< 


Total  number  of  Rules  =  3 


[[SMALL,  TRIANGLE],  [RED,  SMALL,  TRIANGLE],  0.81] 

[[RED,  TRIANGLE],  [RED,  TRIANGLE],  3.690562] 

[[RED,  TRIANGLE],  [BIG,  RED,  TRIANGLE],  1.0] 

'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k 


»»>  RULES  AFTER  PROMOTION  ««< 


Total  number  of  Rules  =  3 


[[SMALL,  TRIANGLE],  [RED,  SMALL,  TRIANGLE],  0.81] 

[[RED,  TRIANGLE],  [RED,  TRIANGLE],  3.690562] 

[[RED,  TRIANGLE],  [BIG,  RED,  TRIANGLE],  1.0] 

'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k 


This  file  contains  the  comparison  data  for  testing. 


>>>>>  Search  Results  <<<<< 


Total  number  of  search  results  =  24 


[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[BLUE,  TRIANGLE],  1.0] 

[BIG,  GREEN,  TRIANGLE],  1.0] 
[RED,  TRIANGLE],  23.001371] 
[BIG,  BLUE,  TRIANGLE],  1.0] 
[GREEN,  TRIANGLE],  1.0] 

[MEDIUM,  TRIANGLE],  1.0] 
[MEDIUM,  GREEN,  TRIANGLE],  1.0] 
[BIG,  RED,  CIRCLE],  1.0] 
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[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[BIG,  RED,  TRIANGLE],  26.001371] 
[MEDIUM,  RED,  SQUARE],  1.0] 

[SMALL,  RED,  CIRCLE],  1.0] 
[TRIANGLE],  1.0] 

[BIG,  TRIANGLE],  1.0] 

[BIG,  RED,  SQUARE],  1.0] 

[SMALL,  BLUE,  TRIANGLE],  1.0] 

[RED,  CIRCLE],  1.0] 

[MEDIUM,  RED,  CIRCLE],  1.0] 

[MEDIUM,  RED,  TRIANGLE],  23.001371] 
[MEDIUM,  BLUE,  TRIANGLE],  1.0] 
[SMALL,  TRIANGLE],  1.0] 

[SMALL,  RED,  TRIANGLE],  23.001371] 
[SMALL,  GREEN,  TRIANGLE],  1.0] 
[SMALL,  RED,  SQUARE],  1.0] 

[RED,  SQUARE],  1.0] 


>>>>>  Unclassif iedResults  <<<<< 


Total  number  of  unclassified  results  =  24 


[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[BIG,  RED,  TRIANGLE],  26.001371] 
[RED,  TRIANGLE],  23.001371] 

[MEDIUM,  RED,  TRIANGLE],  23.001371] 
[SMALL,  RED,  TRIANGLE],  23.001371] 
[GREEN,  TRIANGLE],  1.0] 

[MEDIUM,  TRIANGLE],  1.0] 

[MEDIUM,  GREEN,  TRIANGLE],  1.0] 
[BIG,  RED,  CIRCLE],  1.0] 

[BIG,  GREEN,  TRIANGLE],  1.0] 
[MEDIUM,  RED,  SQUARE],  1.0] 

[SMALL,  RED,  CIRCLE],  1.0] 
[TRIANGLE],  1.0] 

[BIG,  TRIANGLE],  1.0] 

[BIG,  RED,  SQUARE],  1.0] 

[SMALL,  BLUE,  TRIANGLE],  1.0] 

[RED,  CIRCLE],  1.0] 

[MEDIUM,  RED,  CIRCLE],  1.0] 

[BLUE,  TRIANGLE],  1.0] 

[MEDIUM,  BLUE,  TRIANGLE],  1.0] 
[SMALL,  TRIANGLE],  1.0] 

[BIG,  BLUE,  TRIANGLE],  1.0] 

[SMALL,  GREEN,  TRIANGLE],  1.0] 
[SMALL,  RED,  SQUARE],  1.0] 

[RED,  SQUARE],  1.0] 


>>>>>  Classification  Results  <<<<< 


Total  number  of  classified  results  =  24 


[[RED,  TRIANGLE],  [BIG,  RED,  TRIANGLE],  0] 
[[RED,  TRIANGLE],  [RED,  TRIANGLE],  0] 
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[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[MEDIUM,  RED,  TRIANGLE],  I] 
[SMALL,  RED,  TRIANGLE],  0] 
[GREEN,  TRIANGLE],  0] 

[MEDIUM,  TRIANGLE],  0] 
[MEDIUM,  GREEN,  TRIANGLE],  0] 
[BIG,  RED,  CIRCLE],  0] 

[BIG,  GREEN,  TRIANGLE],  0] 
[MEDIUM,  RED,  SQUARE],  0] 
[SMALL,  RED,  CIRCLE],  0] 
[TRIANGLE],  0] 

[BIG,  TRIANGLE],  0] 

[BIG,  RED,  SQUARE],  0] 

[SMALL,  BLUE,  TRIANGLE],  0] 
[RED,  CIRCLE],  0] 

[MEDIUM,  RED,  CIRCLE],  0] 
[BLUE,  TRIANGLE],  0] 

[MEDIUM,  BLUE,  TRIANGLE],  0] 
[SMALL,  TRIANGLE],  0] 

[BIG,  BLUE,  TRIANGLE],  0] 
[SMALL,  GREEN,  TRIANGLE],  0] 
[SMALL,  RED,  SQUARE],  0] 

[RED,  SQUARE],  0] 


»»>  NEW  RULES  ««< 


Total  number  of  new  Rules  =  4 


[[SMALL,  TRIANGLE],  [RED,  SMALL,  TRIANGLE],  0.729] 
[[RED,  TRIANGLE],  [RED,  TRIANGLE],  4.0356293] 
[[RED,  TRIANGLE],  [BIG,  RED,  TRIANGLE],  0.9] 

[[RED,  TRIANGLE],  [MEDIUM,  RED,  TRIANGLE],  I.O] 


»»>  DOCUMENTS  FOR  COMPARISON  TO  FOIL/WINNOW  ««< 


Total  number  of  documents  in  document  file  =  0 


»»>  RULES  BEFORE  PROMOTION  ««< 


Total  number  of  Rules  =  4 


[[SMALL,  TRIANGLE],  [RED,  SMALL,  TRIANGLE],  0.729] 
[[RED,  TRIANGLE],  [RED,  TRIANGLE],  4.0356293] 
[[RED,  TRIANGLE],  [BIG,  RED,  TRIANGLE],  0.9] 

[[RED,  TRIANGLE],  [MEDIUM,  RED,  TRIANGLE],  I.O] 


»»>  RULES  AFTER  PROMOTION  ««< 


Total  number  of  Rules  =  4 


[[SMALL,  TRIANGLE],  [RED,  SMALL,  TRIANGLE],  0.729] 


104 


[[RED,  TRIANGLE],  [RED,  TRIANGLE],  4.0356293] 
[[RED,  TRIANGLE],  [BIG,  RED,  TRIANGLE],  0.9] 
[[RED,  TRIANGLE],  [MEDIUM,  RED,  TRIANGLE],  1.0] 


This  file  contains  the  comparison  data  for  testing. 


>>>>>  Search  Results  <<<<< 


Total  number  of  search  results  =  24 


[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[BLUE,  TRIANGLE],  1.0] 

[BIG,  GREEN,  TRIANGLE],  1.0] 

[RED,  TRIANGLE],  26.357563] 

[BIG,  BLUE,  TRIANGLE],  1.0] 

[GREEN,  TRIANGLE],  1.0] 

[MEDIUM,  TRIANGLE],  1.0] 

[MEDIUM,  GREEN,  TRIANGLE],  1.0] 
[BIG,  RED,  CIRCLE],  1.0] 

[BIG,  RED,  TRIANGLE],  26.531805] 
[MEDIUM,  RED,  SQUARE],  1.0] 

[SMALL,  RED,  CIRCLE],  1.0] 
[TRIANGLE],  1.0] 

[BIG,  TRIANGLE],  1.0] 

[BIG,  RED,  SQUARE],  1.0] 

[SMALL,  BLUE,  TRIANGLE],  1.0] 

[RED,  CIRCLE],  1.0] 

[MEDIUM,  RED,  CIRCLE],  1.0] 

[MEDIUM,  RED,  TRIANGLE],  29.357563] 
[MEDIUM,  BLUE,  TRIANGLE],  1.0] 
[SMALL,  TRIANGLE],  1.0] 

[SMALL,  RED,  TRIANGLE],  26.357563] 
[SMALL,  GREEN,  TRIANGLE],  1.0] 
[SMALL,  RED,  SQUARE],  1.0] 

[RED,  SQUARE],  1.0] 


>>>>>  Unclassif iedResults  <<<<< 


Total  number  of  unclassified  results  =  24 


[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[MEDIUM,  RED,  TRIANGLE],  29.357563] 
[BIG,  RED,  TRIANGLE],  26.531805] 
[RED,  TRIANGLE],  26.357563] 

[SMALL,  RED,  TRIANGLE],  26.357563] 
[GREEN,  TRIANGLE],  1.0] 

[MEDIUM,  TRIANGLE],  1.0] 

[MEDIUM,  GREEN,  TRIANGLE],  1.0] 
[BIG,  RED,  CIRCLE],  1.0] 

[BIG,  GREEN,  TRIANGLE],  1.0] 
[MEDIUM,  RED,  SQUARE],  1.0] 

[SMALL,  RED,  CIRCLE],  1.0] 
[TRIANGLE],  1.0] 

[BIG,  TRIANGLE],  1.0] 

[BIG,  RED,  SQUARE],  1.0] 
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[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[RED, 

TRIANGLE]  , 

[SMALL,  BLUE,  TRIANGLE],  1.0] 
[RED,  CIRCLE],  1.0] 

[MEDIUM,  RED,  CIRCLE],  1.0] 
[BLUE,  TRIANGLE],  1.0] 

[MEDIUM,  BLUE,  TRIANGLE],  1.0] 
[SMALL,  TRIANGLE],  1.0] 

[BIG,  BLUE,  TRIANGLE],  1.0] 
[SMALL,  GREEN,  TRIANGLE],  1.0] 
[SMALL,  RED,  SQUARE],  1.0] 
[RED,  SQUARE],  1.0] 
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Mean  Accuracy  Mean  Accuracy 


Appendix  D  -  Test  Graphs 
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